Section 1: Exploratory Data Analysis
Initial analysis could be taken place with simple ggplots as most of the dataset was made up
of categories rather than numerical values. However, to get statistical analysis, a process of
undersampling had to be conducted first as R could not process all the dataset. Thus, 200,000
rows were undersampled to a total of 6,774 rows of data as shown in Figure 1.
Figure 1: Undersampled Dataset
After looking at which variables were important based on the GLM, linking them to the
ggplots allows for the conclusion that usually, variables with less dominating elements were
more significant. This was especially the case for SURFACE_COND and VEHICLE_TYPE
where the most common element of "Dry" and "Cars" were not significant variables.
Overall, in terms of gender distribution, the plots showed a predominance of males, which
could imply that males are more prone to accidents. Upon investigating the use of seatbelts, it
can be noted that many drivers did not use their seatbelts. This should be a point of concern
as seatbelt usage is known to be crucial in preventing fatalities in the event of accidents. The
data also showed a wide variety of vehicle types involved in accidents. However, these do not
seem to have much relevance to fatalities. Analysis of the day of the week showed that
accidents are fairly evenly distributed across days, with a slight increase during the weekend
and on Thursdays which is assumed because of the shopping day. Furthermore, when
exploring accident types, collisions with other vehicles were the most common. This raises
concerns about needing increased driver education around vehicle spacing and speed
management. Looking at light conditions, most accidents occurred during the day, which
seems to have been because of the large volume of accidents in the daytime. Therefore, light
conditions probability didn't play a major factor in the fatalities. Thus this EDA can aid to
conclude which variables might be valuable to create the GLMs and conduct hypothesis
testing.