6414HW4startertemplateSummer23

.pdf
This is a preview
Want to read all 12 pages? Go Premium today.
View Full Document
Already Premium? Sign in here
Homework 4 Peer Assessment Summer Semester 2023 Background Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcen- tration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository. The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset: 1. nHM - number of heavy atoms (integer) 2. piPC09 - molecular multiple path count (numeric) 3. PCD - difference between multiple path count and path count (numeric) 4. X2Av - average valence connectivity (numeric) 5. MLOGP - Moriguchi octanol-water partition coefficient (numeric) 6. ON1V - overall modified Zagreb index by valence vertex degrees (numeric) 7. N.072 - Frequency of RCO-N< / >N-X=X fragments (integer) 8. B02[C-N] - Presence/Absence of C-N atom pairs (binary) 9. F04[C-O] - Frequency of C-O atom pairs (integer) 10. logBCF - Bioconcentration Factor in log units (numeric) Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R. Please load the dataset "Bio_pred" and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable). Read Data # Clear variables in memory rm( list= ls()) # Import the libraries library(CombMSC) library(boot) library(leaps) library(MASS) library(glmnet) # Ensure that the sampling type is correct RNGkind( sample.kind= "Rejection" ) # Set a seed for reproducibility 1
set.seed( 100 ) # Read data fullData = read.csv( "Bio_pred.csv" , header= TRUE) # Split data for traIning and testing testRows = sample(nrow(fullData), 0.2 *nrow(fullData)) testData = fullData[testRows, ] trainData = fullData[-testRows, ] Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Question 1: Full Model (a) Fit a multiple linear regression with the variable logBCF as the response and the other variables as predictors. Call it model1 . Display the model summary. model1 = lm(logBCF ~ ., data = trainData) summary(model1) ## ## Call: ## lm(formula = logBCF ~ ., data = trainData) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2577 -0.5180 0.0448 0.5117 4.0423 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.001422 0.138057 0.010 0.99179 ## nHM 0.137022 0.022462 6.100 1.88e-09 *** ## piPC09 0.031158 0.020874 1.493 0.13603 ## PCD 0.055655 0.063874 0.871 0.38391 ## X2Av -0.031890 0.253574 -0.126 0.89996 ## MLOGP 0.506088 0.034211 14.793 < 2e-16 *** ## ON1V 0.140595 0.066810 2.104 0.03575 * ## N.072 -0.073334 0.070993 -1.033 0.30202 ## B02.C.N. -0.158231 0.080143 -1.974 0.04879 * ## F04.C.O. -0.030763 0.009667 -3.182 0.00154 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.7957 on 614 degrees of freedom ## Multiple R-squared: 0.6672, Adjusted R-squared: 0.6623 ## F-statistic: 136.8 on 9 and 614 DF, p-value: < 2.2e-16 (b) Which regression coefficients are significant at the 95% confidence level? At the 99% confidence level? 2
names(which(summary(model1)$coefficients[, 4 ]<. 05 )) ## [1] "nHM" "MLOGP" "ON1V" "B02.C.N." "F04.C.O." The coefficients that are significant at 95% confidence level: nHM, MLOGP, ON1V, B02.C.N., F04.C.O. names(which(summary(model1)$coefficients[, 4 ]<. 01 )) ## [1] "nHM" "MLOGP" "F04.C.O." The coefficients that are significant at 99% confidence level: nHM, MLOGP, F04.C.O. (c) What are the Mallow's Cp, AIC, and BIC criterion values for this model? set.seed( 100 ) n = nrow(trainData) Mallow_val = Cp(model1, S2= summary(model1)$sigmaˆ 2 ) AIC_val = AIC(model1, k= 2 ) BIC_val = AIC(model1, k= log(n)) cat( "Mallow_val:" , Mallow_val) ## Mallow_val: 10 cat( " \n " ) cat( "AIC_val:" , AIC_val) ## AIC_val: 1497.477 cat( " \n " ) cat( "BIC_val:" , BIC_val) ## BIC_val: 1546.274 (d) Build a new model on the training data with only the variables which coefficients were found to be statistically significant at the 99% confident level. Call it model2 . Perform a Partial F-test to compare this new model with the full model ( model1 ). Which one would you prefer? Is it good practice to select variables based on statistical significance of individual coefficients? Explain. Null hypothesis: Additional predictors from the full model do not contribute to model significantly. We pick the significance level as 0.05. P value of the partial F test is 0.00523, which is much smaller than the significance level, thus we reject the null hypothesis. This suggest that additional predictors from the full model significantly contribute to the model. Thus I would prefer the model1. It is not a good practice to select variables based on statistical significance of individual coefficients, because as we found that in the comparison of model1 and model2, some of the significant predictors were excluded from the model if we only choose predictors based on statistical significance. 3
Why is this page out of focus?
Because this is a Premium document. Subscribe to unlock this document and more.
Page1of 12
Uploaded by KidIronPuppy25 on coursehero.com