School

Georgia Institute Of Technology **We aren't endorsed by this school

Course

ISYE 6414

Subject

Statistics

Date

Aug 27, 2023

Pages

12

Uploaded by KidIronPuppy25 on coursehero.com

This is a preview

Want to read all 12 pages? Go Premium today.

View Full Document

Already Premium? Sign in here

Homework 4 Peer Assessment
Summer Semester 2023
Background
Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcen-
tration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship).
This dataset was obtained from the UCI machine learning repository.
The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and
the response variable (logBCF) in our dataset:
1.
nHM
- number of heavy atoms (integer)
2.
piPC09
- molecular multiple path count (numeric)
3.
PCD
- difference between multiple path count and path count (numeric)
4.
X2Av
- average valence connectivity (numeric)
5.
MLOGP
- Moriguchi octanol-water partition coefficient (numeric)
6.
ON1V
- overall modified Zagreb index by valence vertex degrees (numeric)
7.
N.072
- Frequency of RCO-N< / >N-X=X fragments (integer)
8.
B02[C-N]
- Presence/Absence of C-N atom pairs (binary)
9.
F04[C-O]
- Frequency of C-O atom pairs (integer)
10.
logBCF
- Bioconcentration Factor in log units (numeric)
Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment,
DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.
Please load the dataset "Bio_pred" and then split the dataset into a train and test set in a 80:20 ratio. Use
the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in
Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).
Read Data
# Clear variables in memory
rm(
list=
ls())
# Import the libraries
library(CombMSC)
library(boot)
library(leaps)
library(MASS)
library(glmnet)
# Ensure that the sampling type is correct
RNGkind(
sample.kind=
"Rejection"
)
# Set a seed for reproducibility
1

set.seed(
100
)
# Read data
fullData
=
read.csv(
"Bio_pred.csv"
,
header=
TRUE)
# Split data for traIning and testing
testRows
=
sample(nrow(fullData),
0.2
*nrow(fullData))
testData
=
fullData[testRows, ]
trainData
=
fullData[-testRows, ]
Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model
performance in Question 7.
Question 1: Full Model
(a) Fit a multiple linear regression with the variable
logBCF
as the response and the other variables as
predictors. Call it
model1
. Display the model summary.
model1
=
lm(logBCF ~ .,
data =
trainData)
summary(model1)
##
## Call:
## lm(formula = logBCF ~ ., data = trainData)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -3.2577 -0.5180
0.0448
0.5117
4.0423
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
0.001422
0.138057
0.010
0.99179
## nHM
0.137022
0.022462
6.100 1.88e-09 ***
## piPC09
0.031158
0.020874
1.493
0.13603
## PCD
0.055655
0.063874
0.871
0.38391
## X2Av
-0.031890
0.253574
-0.126
0.89996
## MLOGP
0.506088
0.034211
14.793
< 2e-16 ***
## ON1V
0.140595
0.066810
2.104
0.03575 *
## N.072
-0.073334
0.070993
-1.033
0.30202
## B02.C.N.
-0.158231
0.080143
-1.974
0.04879 *
## F04.C.O.
-0.030763
0.009667
-3.182
0.00154 **
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7957 on 614 degrees of freedom
## Multiple R-squared:
0.6672, Adjusted R-squared:
0.6623
## F-statistic: 136.8 on 9 and 614 DF,
p-value: < 2.2e-16
(b) Which regression coefficients are significant at the 95% confidence level? At the 99% confidence level?
2

names(which(summary(model1)$coefficients[,
4
]<.
05
))
## [1] "nHM"
"MLOGP"
"ON1V"
"B02.C.N." "F04.C.O."
The coefficients that are significant at 95% confidence level: nHM, MLOGP, ON1V, B02.C.N., F04.C.O.
names(which(summary(model1)$coefficients[,
4
]<.
01
))
## [1] "nHM"
"MLOGP"
"F04.C.O."
The coefficients that are significant at 99% confidence level: nHM, MLOGP, F04.C.O.
(c) What are the Mallow's Cp, AIC, and BIC criterion values for this model?
set.seed(
100
)
n
=
nrow(trainData)
Mallow_val
=
Cp(model1,
S2=
summary(model1)$sigmaˆ
2
)
AIC_val
=
AIC(model1,
k=
2
)
BIC_val
=
AIC(model1,
k=
log(n))
cat(
"Mallow_val:"
, Mallow_val)
## Mallow_val: 10
cat(
"
\n
"
)
cat(
"AIC_val:"
, AIC_val)
## AIC_val: 1497.477
cat(
"
\n
"
)
cat(
"BIC_val:"
, BIC_val)
## BIC_val: 1546.274
(d) Build a new model on the training data with only the variables which coefficients were found to be
statistically significant at the 99% confident level. Call it
model2
. Perform a Partial F-test to compare
this new model with the full model (
model1
).
Which one would you prefer?
Is it good practice to
select variables based on statistical significance of individual coefficients? Explain.
Null hypothesis: Additional predictors from the full model do not contribute to model significantly.
We
pick the significance level as 0.05. P value of the partial F test is 0.00523, which is much smaller than the
significance level, thus we reject the null hypothesis. This suggest that additional predictors from the full
model significantly contribute to the model. Thus I would prefer the model1.
It is not a good practice to select variables based on statistical significance of individual coefficients, because
as we found that in the comparison of model1 and model2, some of the significant predictors were excluded
from the model if we only choose predictors based on statistical significance.
3

Why is this page out of focus?

Because this is a Premium document. Subscribe to unlock this document and more.

Page1of 12