School

Clark University **We aren't endorsed by this school

Course

MSIT MISC

Subject

Statistics

Date

Sep 25, 2023

Type

Other

Pages

5

Uploaded by ConstablePowerDugong23 on coursehero.com

LINEAR REGRESSION - HOMEWORK - GROUP 1
MSDA-3055
7.37. Refer to the CDI data set in Appendix C.2. For predicting the number of active physicians
(Y) in a county, it has been decided to include total population (X1) and total personal income
(X2) as predictor variables. The question now is whether an additional predictor variable
would be helpful in the model and, if so, which variable would be most helpful. Assume that a
first-order multiple regression model is appropriate.
a. For each of the following variables, calculate the coefficient of partial determination given
that X1 and X2 are included in the model: land area (X3), percent of population 65 or older
(X4), number of hospital beds (X5), and total serious crimes (X6).
R-code:
1.
cdi <- read.csv("C:\\Users\\akhil\\OneDrive\\Documents\\cdi.csv")
# Fit the multiple regression model
model <- lm(Number.of.active..physicians ~ Total.population + Total.personal.income +
Land.area + Percent.of.population.65.or.older. + Number.of.hospital.beds + Total.serious.crimes,
data = cdi)
# Fit the initial model with X1 and X2 as predictor variables
model_initial <- lm(Number.of.active..physicians ~ Total.population + Total.personal.income,
data = cdi)
# Calculating the coefficient of partial determination for each additional variable
partial_determination_X3 <- summary(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Land.area, data = cdi))$r.squared - summary(model_initial)$r.squared
partial_determination_X4 <- summary(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Percent.of.population.65.or.older., data = cdi))$r.squared -
summary(model_initial)$r.squared
partial_determination_X5 <- summary(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Number.of.hospital.beds, data = cdi))$r.squared -
summary(model_initial)$r.squared
partial_determination_X6 <- summary(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Total.serious.crimes, data = cdi))$r.squared - summary(model_initial)
$r.squared
# Print the results
partial_determination_X3
1

LINEAR REGRESSION - HOMEWORK - GROUP 1
MSDA-3055
partial_determination_X4
partial_determination_X5
partial_determination_X6
OUTPUT:
> cdi <- read.csv("C:\\Users\\akhil\\OneDrive\\Documents\\cdi.csv")
> # Fit the multiple regression model
> model <- lm(Number.of.active..physicians ~ Total.population + Total.personal.income +
Land.area + Percent.of.population.65.or.older. + Number.of.hospital.beds +
Total.serious.crimes, data = cdi)
> # Fit the initial model with X1 and X2 as predictor variables
> model_initial <- lm(Number.of.active..physicians ~ Total.population +
Total.personal.income, data = cdi)
> # Calculating the coefficient of partial determination for each additional variable
> partial_determination_X3 <- summary(lm(Number.of.active..physicians ~ Total.population
+ Total.personal.income + Land.area, data = cdi))$r.squared - summary(model_initial)
$r.squared
> partial_determination_X4 <- summary(lm(Number.of.active..physicians ~ Total.population
+ Total.personal.income + Percent.of.population.65.or.older., data = cdi))$r.squared -
summary(model_initial)$r.squared
> partial_determination_X5 <- summary(lm(Number.of.active..physicians ~ Total.population
+ Total.personal.income + Number.of.hospital.beds, data = cdi))$r.squared -
summary(model_initial)$r.squared
> partial_determination_X6 <- summary(lm(Number.of.active..physicians ~ Total.population
+ Total.personal.income + Total.serious.crimes, data = cdi))$r.squared -
summary(model_initial)$r.squared
> # Print the results
> partial_determination_X3
[1] 0.002889597
> partial_determination_X4
[1] 0.0003851834
> partial_determination_X5
[1] 0.05551826
> partial_determination_X6
[1] 0.0007341451
b. On the basis of the results in part (a), which of the four additional predictor variables is
best? Is the extra sum of squares associated with this variable larger than those for the other
three variables?
R-CODE:
# Calculating the extra sum of squares for each additional variable
extra_sumsquares_X3 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Land.area, data = cdi))$"Sum Sq"[4]
extra_sumsquares_X4 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Percent.of.population.65.or.older., data = cdi))$"Sum Sq"[4]
2

LINEAR REGRESSION - HOMEWORK - GROUP 1
MSDA-3055
extra_sumsquares_X5 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Number.of.hospital.beds, data = cdi))$"Sum Sq"[4]
extra_sumsquares_X6 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Total.serious.crimes, data = cdi))$"Sum Sq"[4]
# Compare the extra sum of squares
extra_sumsquares <- c(extra_sumsquares_X3, extra_sumsquares_X4, extra_sumsquares_X5,
extra_sumsquares_X6)
best_variable <- which.max(extra_sumsquares)
# Print the results
extra_sumsquares
best_variable
OUTPUT:
> # Calculating the extra sum of squares for each additional variable
> extra_sumsquares_X3 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Land.area, data = cdi))$"Sum Sq"[4]
> extra_sumsquares_X4 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Percent.of.population.65.or.older., data = cdi))$"Sum Sq"[4]
> extra_sumsquares_X5 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Number.of.hospital.beds, data = cdi))$"Sum Sq"[4]
> extra_sumsquares_X6 <- anova(lm(Number.of.active..physicians ~ Total.population +
Total.personal.income + Total.serious.crimes, data = cdi))$"Sum Sq"[4]
> # Compare the extra sum of squares
> extra_sumsquares <- c(extra_sumsquares_X3, extra_sumsquares_X4,
extra_sumsquares_X5, extra_sumsquares_X6)
> best_variable <- which.max(extra_sumsquares)
> # Print the results
> extra_sumsquares
[1] 136903711 140425434
62896949 139934722
> best_variable
[1] 2
C. Using the F* test statistic, test whether or not the variable determined to be best in part (b)
is helpful in the regression model when XI and X2 are included in the model; use alpha = .01
State the alternatives, decision rule, and conclusion. Would the F* test statistics for the other
three potential predictor variables be as large as the one here? Discuss.
3