FISH HOMEWORK

.pdf
This is a preview
Want to read all 6 pages? Go Premium today.
View Full Document
Already Premium? Sign in here
HW 2 Peer Assessment Background The fishing industry uses numerous measurements to describe a specific fish. Our goal s to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurements. Data Description The data consists of the following variables: 1. Weight: weight of fish in g (numerical) 2. Species: species name of fish (categorical) 3. Body.Height: height of body of fish in cm (numerical) 4. Total.Length: length of fish from mout to tail in cm (numerical) 5. Diagonal.Length: length of diagonal of main body of fish in cm (numerical) 6. Height: height of head of ish in cm (numerical) 7. Width: width of head of fish in cm (numerical) Read the data # Inport Library you may need Library(car) ## Warning: package 'car' was built under R version 4.2.3 ## Loading required package: carData # Read the data set fishfull = read.csv("Fish.csv",header=T, fileEncoding = 'UTF-5-501") rou.cnt = nrow(fishfull) # Split the data into tr ning and testing sets fishtest = Fishfull[(row.cnt-9):row.cnt,] fish = fishfull[1: (row.cnt-10),] Please use fish as your data set for the following questions unless otherwise stated. Question 1: Exploratory Data Analysis [8 points] (a) Create a box plot comparing the response variable, Weight, actoss the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response? boxplot (Weight~Species, main="Fish Weight by Species", xlab lieight", col-blues9, data=Fish) Fish W 1500 1000 ! Weight 500 ! f ° ) - = = o - T T T T T T T Bream Parkki Perch Pike Roach ~ Smelt Whitefish Species Based on the above box plot comparing Weight across the multiple fish species, there does seem to be a refationship between the species of the fish and weight. The Roach, Parkki and Smelt species seem to have much lower weight and also less variability in weight than the other species (b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body Height, Total Length, Diagonal Length, Height, and Widih. Describe the general trend of each plot. Are there any potential outliers? par(mfrou = c(1,1)) plot (fish[-2]) 0 % s 70 35 7 e I/"'H/TMM A e il S 1000 oy Height 10 10 40 DiagonalLengih 10 40 70 Based on the scatterplots of weights against the various measurements, there does seem to be linear relationships between weight and each measurement. There also seems to be a strong linear relationship between Body Height and Total Length as well as Diagonal Length. There is a retationship between Body Height and Head height and width as well, however itis not as strong as the previously stated relationship. Total Length and Diagonal Length also seem to have strong linear relationship and Diagonal Length Height and Width all have fairly strong linear retationships. There does seem to be a possible outlier associated with a Roach fish at total weight of 1700g but very low measurements for all the predicting variables. (c) Display the correlations between each of the quantitative variables. Interpret the. in the context of the of the tothe and in the context of cor(fish[,3:7]) w Body.Height Total.length Disgonal.Length Height Width ## Body.Height 1.0000000 0.9995134 2.9919502 0.6265604 0.3661852 ## Total.Length 0.9995132 1.0000000 2.9940896 0.6422261 0.3728030 ## Diagonal.length 0.9919502 0.9940896 1.0000000 0.7052116 ©.5770361 ## Height 0.6268604 0.6422261 ©.7052116 1.0000000 0.7905491 ## Width 0.5661882 0.5725030 ©.5770361 0.7998491 1.0000000
The maximum correlation between predicting variables is 0.999 which is an extremely strong correlation. All correlations are greater than 0.64 and almost all are greater than 0.85 which shows extremely strong correlations among all the predictors. This indicates that many of the predictors will have strong correlations with the response and also hints that there may be some multicollinearity among the predictors (d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables? Based on the exploratory analysis, it can be reasonably assumed that a multiple linear regression model s suitable for the refationship between Weight and the predictor variables. Question 2: Fitting the Multiple Linear Regression Model [8 points] Create the full mode! without transforming the response variable or predicting variables using the fish data set. Do not use fishtest (2) Build a multiple linear regression model, called model1, using the and all . Display the y table of the model. model1<- Im(Ueightw., Fish) summary(mode11) cm ## Call: ## In(formula = Weight ~ ., data = Fish) cm ## Residuals: w min 10 Median 0 hex #-214.04 -S1.62 -15.07 34.27 434.64 cm ## Coefficients: cm Estinate Std. Error t value Pr(>t]) ## (Intercept) 976,375 130.951 -7.456 9.18e-12 *** ## SpeciesParkki 200.426 79.594 2.515 0.012950 * ## Speciesperch 181.188 123.705 1.465 0.145299 ## SpeciesPike 184107 139.774 -1.317 0.189979 ## Specieshoach 139.655 94.066 1.485 0.139935 ## Speciessmelt 491.152 122902 3.996 0.000105 *** ## Specieshhitefish 126.262 98.907 1.277 0.203911 ## Body. Height -72.057 36.802 -1.958 0.052265 . ## Total.Length 66.686 46.631 1.430 0.154971 ## Disgonal.length 35.879 29.678 1.310 0.192387 ## Height 8.802 13.247 0.664 0.507529 o width -9.945 24.311 -0.409 0.683010 - ## Signif. codes: © '***' 0.001 '**' 0.01 '*' 0.65 '.' 0.1° ' 1 w ## Residual standard error: 93.65 on 137 degrees of freedom ## Multiple R-squared: 0.9387, Adjusted R-squared: 0.9338 ## F-statistic: 190.9 on 11 and 137 DF, p-value: < 2.2e-16 (b) Is the overall regression ficant at an o level of 0.01? Explain. With an F-value of 66.3 which corresponds to a p-value less than a = 0.01, the overall regression s significant, indicating that at least one of the predicting variables has explanatory power on the weight of fish. () What is the coefficient estimate for Body.Height? Interpret this coefficient. The coefficient estimate for Body Height is -176.87. This can be interpreted as the weight of fish decreases by 176.87g for each cm change of Body Height, while holding all other predictors fixed nt. () What is the coefficient estimate for the Species category Parkki? Interpret this coeffi The coefficient estimate for the Species Parkki is 79.34. This can be interpreted as the weight of a fish increases by 79.34g it its species is Parkki, while holding all other predictors fixed Question 3: Checking for Outliers and Multicollinearity [6 points] (a) Create a plot for the Cook's Distances. Using a threshold Cook's Distance of 1, identify the row numbers of any outliers. cooks- cooks.distance(model1) plot(cook, type="h", lud=2, col "orange", ylab= "Cook's Di: Cook's Distance 0 50 100 150 0.00 Index cook[ (caok>1)] ## named numeric() When using a Cook's Distance of 1, the above plot shows 1 outlier at row 30 which a Cook's Distance of 1.47. (b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model. fish.red<-fish[-30,] model2<- Im(Wieight~., Fish.red) summary(mode12) cm ## Call: ## In(formula = Weight ~ ., data = Fish.red) cm ## Residuals: w o Min 10 Median 0 hex w5 -211.10 -50.18 -14.44 34.04 433.68 cm ## Coefficients: cm Estinate Std. Error t value Pr(>t]) ## (Intercept) -969.766 131.601 -7.369 1.51e-11 *** ## SpeciesParkki 195.500 80.105 2.441 0.015051 * ## Speciesperch 174.201 124.404 1.401 0.163608 ## SpeciesPike 175936 146.605 -1.251 0.212983 ## Specieshoach 141.87 94.319 1.504 0.134871 ## Speciessmelt 489.714 123.174 3.976 0.000113 *** ## Specieshhitefish 122.277 99.203 1.231 0.220270 ## Body. Height 76.321 37.437 -2.039 0.043422 *
## Total.Length 74822 45.319 1.549 0.123825 ## Diagonal.length 34.389 30.518 1.126 0.262350 ## Height lo.ce0 13.395 0.746 0.456692 ## Width 5.3 24.483 -0.341 0.733924 - ## Signif. codes: © '***' 0.001 "**' 0.01 '*' 0.05 . 0.1 ' ' 1 w ## Residual standard error: 93.34 on 136 degrees of freedom ## Multiple R-squared: 0.9385, Adjusted R-squared: 0.9335 ## F-statistic: 188.6 on 11 and 136 DF, p-value: < 2.2e-16 (c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-R?) what conclusions can you draw? vif(model2) w GUIF Df GVIFA(1/(2%DF)) ## Species 1545.55017 6 1.543983 ## Body.Height 2371.15420 1 48.694499 ## Total.length 4540.47695 1 67.383062 ## Diagonal.length 2126.64985 1 46.115614 ## Height 56.21375 1 7.497583 ## Width 29.01683 1 5.386727 #Calcluate 1/(1-$r"2$) value for VIF thresh rsq<- 0.9385 vif thresh<-1/(1-rsq) Vif thresh ## [1] 16.26016 The VIF threshold is max(10, 16.26). Based on the threshold and the calculated VIF values, there seems to be strong mutticollinearity among the predictors. Species, Body Height, Total Length and Diagonal Length specifically have extremely high VIFs meaning they must be highly correlated with other predictors. Question 4: Checking Model Assumptions [6 points] Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions. (a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors? res<-residuals(model2) par(mfrow=c(2,3)) plot(fish.redsBody.Height, res, xlab="Body Height", ylab="Residuals"); abline( plot(fish.redsTotal.Length, res, xlab="Total Length", ylab="Residuals"); abline(h=0) plot(fish.redsDiagonal.Length, res, xlab="Diagonal Length", ylab="Residuals"); abline(h=0) plot(fish.redSHeight, res, xlab="Height", ylab="Residuals"); abline(h=e) plot(fish.redsuidth, res, xlab="uiidth", ylab="Residuals"); abline(h=0) 3 8 %8 %8713 °q i i i - § § g4 H 02 W w0 02 0 w0 50070 oty it TotaLengn DigonatLengtn 584 % 584 ° TR 3 Ty o and & i L s 1234567 e wian Based on the scatterplots of residuals vs predictors, there is a fairly weak random scatter around the zero line. The plots of Height and Width have a stronger random scatter, however in all the plots there seem to be a greater amount of residuals located below the zero line. (b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance 'assumption appear to hold? Do the errors appear uncorrelated? fit<-model2sfitted.values plot(res, fit, xlab="Residuals", ylal Fitted Values"); abline(h=0) 3 3 o o o o 8 o, B o o0 2 00 8% o 6o, 2 o o E] o o oo o o S g lo o0 "hgo0 0P @ 84 000 3 8 90 00f [ £ 2oy o E ) ° o o ° B T %0 E o 3 T T T T T T T -200 -100 0 100 200 300 400 Residuals The scatter plot of ftted values vs residuals does not appear to have a random scatter around zero indicating that the constant variance assumption does not hold. The scatter also seems to show somewhat of a megaphone effect. The errors appear to be uncorrelated as no clusters seem to be formed in the residuals plot (c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots? par(nfrow=c(1,2)) hist(res, xlab="Residuals", main= qanorm(res); qqline(res) Histogram of Residuals") Histogram of Residuals Normal Q-Q Plot 50 60 70 200 300 400 @ 8 s sncy 40
Why is this page out of focus?
Because this is a Premium document. Subscribe to unlock this document and more.
Page1of 6
Uploaded by BarristerTitaniumEchidna21 on coursehero.com