# ACTL3142 Main Assignment (1)

.pdf
VicRoads: Fatalities on the roads in Victoria Analysis Report ___________________________________________________________________________ ACTL3142 - Statistical Machine Learning for Risk and Actuarial Applications Jay Jeong z5416265
Executive Summary Overview This report aims to design appropriate policies to reduce fatalities on the roads in Victoria by offering insight into various factors that affect a fatal crash with predictive models. Exploratory Data Analysis Preliminary exploration of the data reveals insightful patterns and correlations. The main use of this step was to identify which variables should be looked at in greater detail in order to build a relevant predictive model. Determining Factors of Fatal Accidents Using a generalised linear model (GLM), the factors that should be used for the analysis were chosen. Predictive Model for Prevention Building on the insights from the model, we developed a predictive model to identify drivers most at risk of being involved in a fatal accident. By using demographic information and vehicle characteristics, we are able to accurately target 2,500 drivers out of a set of 10,000 who are most likely to be involved in a fatal accident. Intervention Strategies for Fatal Crashes By understanding the variables that contribute to fatal accidents, preventative strategies for VicRoads are explored aimed at reducing road fatalities.
Section 1: Exploratory Data Analysis Initial analysis could be taken place with simple ggplots as most of the dataset was made up of categories rather than numerical values. However, to get statistical analysis, a process of undersampling had to be conducted first as R could not process all the dataset. Thus, 200,000 rows were undersampled to a total of 6,774 rows of data as shown in Figure 1. Figure 1: Undersampled Dataset After looking at which variables were important based on the GLM, linking them to the ggplots allows for the conclusion that usually, variables with less dominating elements were more significant. This was especially the case for SURFACE_COND and VEHICLE_TYPE where the most common element of "Dry" and "Cars" were not significant variables. Overall, in terms of gender distribution, the plots showed a predominance of males, which could imply that males are more prone to accidents. Upon investigating the use of seatbelts, it can be noted that many drivers did not use their seatbelts. This should be a point of concern as seatbelt usage is known to be crucial in preventing fatalities in the event of accidents. The data also showed a wide variety of vehicle types involved in accidents. However, these do not seem to have much relevance to fatalities. Analysis of the day of the week showed that accidents are fairly evenly distributed across days, with a slight increase during the weekend and on Thursdays which is assumed because of the shopping day. Furthermore, when exploring accident types, collisions with other vehicles were the most common. This raises concerns about needing increased driver education around vehicle spacing and speed management. Looking at light conditions, most accidents occurred during the day, which seems to have been because of the large volume of accidents in the daytime. Therefore, light conditions probability didn't play a major factor in the fatalities. Thus this EDA can aid to conclude which variables might be valuable to create the GLMs and conduct hypothesis testing.