1
. (ISLR2, Q2.1) For each of parts (a) through (d), indicate whether we would generally expect the
performance of a flexible statistical learning method to be better or worse than an inflexible method.
Justify your answer.
a. The sample size
is extremely large, and the number of predictors
is small.
b. The number of predictors
is extremely large, and the number of observations
is small.
c. The relationship between the predictors and response is highly nonlinear.
d. The variance of the error terms, i.e.,
, is extremely high.
Solution
2
. (ISLR2, Q2.2) Explain whether each scenario is a classification or regression problem, and indicate
whether we are most interested in inference or prediction. Finally, provide
and .
a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number
of employees, industry and the CEO salary. We are interested in understanding which factors
affect CEO salary.
b. We are considering launching a new product and wish to know whether it will be a success or a
failure. We collect data on 20 similar products that were previously launched. For each product
we have recorded whether it was a
success
or
failure
, price charged for the product,
marketing budget, competition price, and ten other variables.
c. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the
weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For
each week we record the % change in the USD/Euro, the % change in the US market, the %
change in the British market, and the % change in the German market.
Solution
3
. (ISLR2, Q2.3) We now revisit the biasvariance decomposition.
a. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or
irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods
towards more flexible approaches. The
axis should represent the amount of flexibility in the
method, and the
axis should represent the values for each curve. There should be five curves.
Make sure to label each one.
b. Explain why each of the five curves has the shape displayed in part (a).
Solution
4
. (ISLR2, Q2.5) What are the advantages and disadvantages of a very flexible (versus a less flexible)
approach for regression or classification
?
Under what circumstances might a more flexible approach
be preferred to a less flexible approach
?
When might a less flexible approach be preferred
?
Solution
5
. (ISLR2, Q2.7) The table below provides a training data set containing six observations, three
predictors, and one qualitative response variable.
Obs.
1
0
3
0
Red
2
2
0
0
Red
3
0
1
3
Red
4
0
1
2
Green
5
1
0
1
Green
6
1
1
1
Red
Suppose we wish to use this data set to make a prediction for
when
=
=
= 0 using

nearest neighbors.
a. Compute the Euclidean distance between each observation and the test point,
=
=
=
0.
b. What is our prediction with
?
Why
?
c. What is our prediction with
?
Why
?
d. If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the
best
value for
to be large or small
?
Why
?
Solution
1
. (ISLR2, Q2.8) This exercise relates to the
College
data set, which can be found in the file
College.csv
on the book website. It contains a number of variables for 777 different universities
and colleges in the US. The variables are
Private
: Public/private indicator
Apps
: Number of applications received
Accept
: Number of applicants accepted
Enroll
: Number of new students enrolled
Top10perc
: New students from top 10% of high school class
Top25perc
: New students from top 25% of high school class
F.Undergrad
: Number of fulltime undergraduates
P.Undergrad
: Number of parttime undergraduates
Outstate
: Outofstate tuition
Room.Board
: Room and board costs
Books
: Estimated book costs
Personal
: Estimated personal spending
PhD
: Percent of faculty with Ph.D.'s
Terminal
: Percent of faculty with terminal degree
S.F.Ratio
: Student/faculty ratio
perc.alumni
: Percent of alumni who donate
Expend
: Instructional expenditure per student
Grad.Rate
: Graduation rate
Before reading the data into
R
, it can be viewed in Excel or a text editor.
a. Use the
read.csv()
function to read the data into
R
. Call the loaded data
college
. Make sure
that you have the directory set to the correct location for the data.
b. Look at the data using the
View()
function. You should notice that the first column is just the
name of each university. We don't really want
R
to treat this as data. However, it may be handy
to have these names for later. Try the following commands:
You should see that there is now a
row.names
column with the name of each university
recorded. This means that
R
has given each row a name corresponding to the appropriate
university.
R
will not try to perform calculations on the row names. However, we still need to
eliminate the first column in the data where the names are stored. Try
Now you should see that the first data column is
Private
. Note that another column labeled
row.names
now appears before the
Private
column. However, this is not a data column but
rather the name that
R
is giving to each row.
i.
c.
Use the
summary()
function to produce a numerical summary of the variables in the data
set.
ii. Use the
pairs()
function to produce a scatterplot matrix of the first ten columns or
variables of the data. Recall that you can reference the first ten columns of a matrix
A
using
A[,1:10]
.
iii. Use the
plot()
function to produce sidebyside boxplots of
Outstate
versus
Private
.
d. Create a new qualitative variable, called
Elite
, by
binning
the
Top10perc
variable. We are
going to divide universities into two groups based on whether or not the proportion of students
coming from the top 10% of their high school classes exceeds 50%.
Use the
summary()
function to see how many elite universities there are. Now use the
plot()
function to produce sidebyside boxplots of
Outstate
versus
Elite
.
e. Use the
hist()
function to produce some histograms with differing numbers of bins for a few
of the quantitative variables. You may find the command
par(mfrow = c(2, 2))
useful: it will
divide the print window into four regions so that four plots can be made simultaneously.
Modifying the arguments to this function will divide the screen in other ways.
f. Continue exploring the data, and provide a brief summary of what you discover.
Solution
2
. (ISLR2, Q2.9) This exercise involves the
Auto
data set studied in the lab. Make sure that the missing
values have been removed from the data.
a. Which of the predictors are quantitative, and which are qualitative
?
b. What is the range of each quantitative predictor
?
You can answer this using the
range()
function.
c. What is the mean and standard deviation of each quantitative predictor
?
d. Now remove the 10th through 85th observations. What is the range, mean, and standard
deviation of each predictor in the subset of the data that remains
?
e. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of
your choice. Create some plots highlighting the relationships among the predictors. Comment
on your findings.
f. Suppose that we wish to predict gas mileage (
mpg
) on the basis of the other variables. Do your
plots suggest that any of the other variables might be useful in predicting
mpg
?
Justify your
answer.
Solution
3
. (ISLR2, Q2.10) This exercise involves the
Boston
housing data set.
a. To begin, load in the
Boston
data set. The
Boston
data set is part of the
ISLR2
library
.
Now the data set is contained in the object
Boston
.
Read about the data set:
How many rows are in this data set
?
How many columns
?
What do the rows and columns
represent
?
b. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your
findings.
c. Are any of the predictors associated with per capita crime rate
?
If so, explain the relationship.
d. Do any of the census tracts of Boston appear to have particularly high crime rates
?
Tax rates
?
Pupilteacher ratios
?
Comment on the range of each predictor.
e. How many of the census tracts in this data set bound the Charles river
?
f. What is the median pupilteacher ratio among the towns in this data set
?
g. Which census tract of Boston has lowest median value of owneroccupied homes
?
What are the
values of the other predictors for that census tract, and how do those values compare to the
overall ranges for those predictors
?
Comment on your findings.
h. In this data set, how many of the census tracts average more than seven rooms per dwelling
?
More than eight rooms per dwelling
?
Comment on the census tracts that average more than
eight rooms per dwelling.
Solution
a.
1
.
Better: flexible models are better able to capture all the trends in the large amount of data we
have.
b. Worse: flexible models will tend to overfit the small amount of data we have using the large
number of predictors.
c. Better: inflexible models tend to have a hard time fitting nonlinear relationships.
d. Worse: flexible models will tend to fit the noise, which is not desired.
a.
2
.
Regression: the response (CEO salary) is continuous. Inference: We are interested in the factors
influencing CEO salary  we don't want to estimate it using a company's information
!
(500 companies in the data set)
(predictors: profit, number of employees, industry;
response: CEO salary)
b. Classification: the response (success or failure) is discrete. Prediction: based on various input
factors, we want to estimate how well the product will do
(20 similar products in the
data set)
(predictors: marketing budget, price charged, competition price, +10 others;
response: whether it was a success or failure)
c. Regression: the response (% change in US dollar) is continuous. Prediction: it's written in the
question
!
We are interested in predicting changes in the US dollar.
(number of trading
weeks in a year)
(predictors: % change in US market, % change in UK market, % change
in DE market; response % change in US dollar)
a.
3
.
See the figure
b. Bias: increasing flexibility reduces model bias
Training error: increasing flexibility makes the model fit the training data better
Variance: increasing flexibility makes the model incorporate more noise (in other words, it makes
the fit bumpier)
Test error: concave up, since increasing flexibility makes the model fit more of the trend in the
data until it starts fitting the noise in the data
Bayes' (irreducible) error: horizontal line, since it's a constant for all models. When the test error
dips below the Bayes' error, the model is overfitting, so the test error starts to increase
4
. Advantages: can fit a larger variety of (nonlinear) models, decreasing bias.
Disadvantages: can lead to overfitting (hence worse results), requires estimating more parameters,
and increasing model variance as it incorporates more noise.
More flexible models would be preferred if the model is nonlinear in nature, or interpretability is not
a major issue. Less flexible models are preferred when inference is the goal of the model fitting
exercise
a.
5
.
See the table below
Obs.
Distance
1
0
3
0
3
Red
2
2
0
0
2
Red
3
0
1
3
Red
4
0
1
2
Green
5
1
0
1
Green
6
1
1
1
Red
b. Green.
so we only take the closest observation (5).
c.
so we consider the closest 3: 5, 6 and 2. The majority are Red, so this classifies as Red.
d. A smaller
would lead to a more flexible decision boundary, which would account for the non
linearity better.
Refer to Section 2.3 of ISLR2 for a primer of applied questions.
a.
1
.
Note that the dataset is available from the course Moodle site
b. The row names need to be changed to college names as follow
Private Apps Accept Enroll Top10perc Top25perc
Abilene Christian University
Yes 1660
1232
721
23
52
Adelphi University
Yes 2186
1924
512
16
29
Adrian College
Yes 1428
1097
336
22
50
Agnes Scott College
Yes
417
349
137
60
89
Alaska Pacific University
Yes
193
146
55
16
44
Albertson College
Yes
587
479
158
38
62
F.Undergrad P.Undergrad Outstate Room.Board Books
Abilene Christian University
2885
537
7440
3300
450
Adelphi University
2683
1227
12280
6450
750
Adrian College
1036
99
11250
3750
400
Agnes Scott College
510
63
12960
5450
450
Alaska Pacific University
249
869
7560
4120
800
Albertson College
678
41
13500
3335
500
Personal PhD Terminal S.F.Ratio perc.alumni Expend
Abilene Christian University
2200
70
78
18.1
12
7041
Adelphi University
1500
29
30
12.2
16
10527
Adrian College
1165
53
66
12.9
30
8735
Agnes Scott College
875
92
97
7.7
37
19016
Alaska Pacific University
1500
76
72
11.9
2
10922
Albertson College
675
67
73
9.4
11
9727
Grad.Rate
Abilene Christian University
60
Adelphi University
56
Adrian College
54
Agnes Scott College
59
Alaska Pacific University
15
Albertson College
55
Private
Apps
Accept
Enroll
Top10perc
No :212
Min.
:
81
Min.
:
72
Min.
:
35
Min.
: 1.00
Yes:565
1st Qu.:
776
1st Qu.:
604
1st Qu.: 242
1st Qu.:15.00
Median : 1558
Median : 1110
Median : 434
Median :23.00
Mean
: 3002
Mean
: 2019
Mean
: 780
Mean
:27.56
3rd Qu.: 3624
3rd Qu.: 2424
3rd Qu.: 902
3rd Qu.:35.00
Max.
:48094
Max.
:26330
Max.
:6392
Max.
:96.00
Top25perc
F.Undergrad
P.Undergrad
Outstate
Min.
:
9.0
Min.
:
139
Min.
:
1.0
Min.
: 2340
1st Qu.: 41.0
1st Qu.:
992
1st Qu.:
95.0
1st Qu.: 7320
Median : 54.0
Median : 1707
Median :
353.0
Median : 9990
Mean
: 55.8
Mean
: 3700
Mean
:
855.3
Mean
:10441
3rd Qu.: 69.0
3rd Qu.: 4005
3rd Qu.:
967.0
3rd Qu.:12925
Max.
:100.0
Max.
:31643
Max.
:21836.0
Max.
:21700
Room.Board
Books
Personal
PhD
Min.
:1780
Min.
:
96.0
Min.
: 250
Min.
:
8.00
1st Qu.:3597
1st Qu.: 470.0
1st Qu.: 850
1st Qu.: 62.00
Median :4200
Median : 500.0
Median :1200
Median : 75.00
Mean
:4358
Mean
: 549.4
Mean
:1341
Mean
: 72.66
3rd Qu.:5050
3rd Qu.: 600.0
3rd Qu.:1700
3rd Qu.: 85.00
Max.
:8124
Max.
:2340.0
Max.
:6800
Max.
:103.00
Terminal
S.F.Ratio
perc.alumni
Expend
Min.
: 24.0
Min.
: 2.50
Min.
: 0.00
Min.
: 3186
1st Qu.: 71.0
1st Qu.:11.50
1st Qu.:13.00
1st Qu.: 6751
Median : 82.0
Median :13.60
Median :21.00
Median : 8377
Mean
: 79.7
Mean
:14.09
Mean
:22.74
Mean
: 9660
3rd Qu.: 92.0
3rd Qu.:16.50
3rd Qu.:31.00
3rd Qu.:10830
Max.
:100.0
Max.
:39.80
Max.
:64.00
Max.
:56233
Grad.Rate
Min.
: 10.00
1st Qu.: 53.00
Median : 65.00
Mean
: 65.46
3rd Qu.: 78.00
Max.
:118.00
ii.
Private
is not numerical, so cannot be used in pairs so we plot from column 2 onward
No Yes
699
78
v. For instance the following code gives histograms for variables
Apps
,
Accept
,
Top10perc
and
Top25perc
.
2
. Note that the dataset is available from the course Moodle site
a. Qualitative:
name
,
origin
. Quantitative:
mpg
,
cylinders
,
displacement
,
horsepower
,
weight
,
acceleration
,
year
.
b. We can use function
apply
combined with function
range
:
mpg cylinders displacement horsepower weight acceleration year
min
9.0
3
68
46
1613
8.0
70
max 46.6
8
455
230
5140
24.8
82
c. Use a similar application of function
apply
:
mean
sd.
mpg
23.445918
7.805007
cylinders
5.471939
1.705783
displacement
194.411990 104.644004
horsepower
104.469388
38.491160
weight
2977.584184 849.402560
acceleration
15.541327
2.758864
year
75.979592
3.683737
d. Semantic note: the following will remove the 10th to the 85th row, which may not be what we
want, since we have already removed some rows to begin with:
You will find that observation #86 has errantly been removed. That is because the
na.omit
from earlier removed an observation in this range. It is possible to refer to the rows by
observation number, which is a character string. In other words,
auto["5",]
will give me
observation #5, even if 14 are missing. This does complicate the procedure, though.
Use
apply
function:
min
max
mean
sd.
mpg
11.0
46.6
24.368454
7.880898
cylinders
3.0
8.0
5.381703
1.658135
displacement
68.0
455.0
187.753943
99.939488
horsepower
46.0
230.0
100.955836
35.895567
weight
1649.0 4997.0 2939.643533 812.649629
acceleration
8.5
24.8
15.718297
2.693813
year
70.0
82.0
77.132492
3.110026
e. For instance a pairwise plot can be produced using:
ACTL3142 and ACTL5110
Lab 1: Introduction To Statistical Learning
AUTHOR
Questions
Conceptual Questions
n
p
p
n
σ
2
=
V
( )
n
p
x
y
X
1
X
2
X
3
Y
Y
X
1
X
2
X
3
K
X
1
X
2
X
3
K
= 1
K
= 3
K
Applied Questions
rownames
(college) < college[,
1
]
View
(college)
college < college[,

1
]
View
(college)
Elite <
rep
(
"No"
,
nrow
(college))
Elite[college
$
Top10perc
>
50
] <
"Yes"
Elite <
as.factor
(Elite)
college <
data.frame
(college, Elite)
library
(ISLR2)
Boston
?Boston
Solutions
Conceptual Questions
n
= 500
p
= 3
n
= 20
p
= 13
n
≈
50
p
= 3
image
X
1
X
2
X
3
Y
√
10
≈
3.2
√
5
≈
2.2
√
2
≈
1.4
√
3
≈
1.7
K
= 1
K
= 3
K
Applied Questions
college <
read.csv
(
"College.csv"
)
rownames
(college) < college[,
1
]
college < college[,

1
]
college
$
Private <
as.factor
(college
$
Private)
head
(college)
# You should instead try `View(college)`
summary
(college)
pairs
(college[,
1
:
10
])
plot
(college
$
Private, college
$
Outstate)
Elite <
with
(college,
ifelse
(Top10perc
>
50
,
"Yes"
,
"No"
))
Elite <
as.factor
(Elite)
college <
data.frame
(college, Elite)
summary
(Elite)
# there are 78 elite universities
plot
(college
$
Elite, college
$
Outstate)
par
(
mfrow =
c
(
2
,
2
))
hist
(college
$
Apps,
breaks =
20
)
hist
(college
$
Accept,
breaks =
20
)
hist
(college
$
Top10perc,
breaks =
20
)
hist
(college
$
Top25perc,
breaks =
20
)
auto <
read.csv
(
"Auto.csv"
,
na.strings =
"?"
)
auto <
na.omit
(auto)
# remove missing values
quant.var <
c
(
"mpg"
,
"cylinders"
,
"displacement"
,
"horsepower"
,
"weight"
,
"acceleration"
,
"year"
)
ranges.df <
apply
(auto[, quant.var],
2
, range)
rownames
(ranges.df) <
c
(
"min"
,
"max"
)
ranges.df
means.df <
apply
(auto[, quant.var],
2
, mean)
std.df <
apply
(auto[, quant.var],
2
, sd)
distns.df <
rbind
(means.df, std.df)
rownames
(distns.df) <
c
(
"mean"
,
"sd."
)
t
(distns.df)
subauto < auto[

(
10
:
85
), ]
rid <
rownames
(auto)
rid < rid[
as.numeric
(rid)
<
10

as.numeric
(rid)
>
85
]
subauto < auto[rid, ]
subranges.df <
apply
(subauto[, quant.var],
2
, range)
submeans.df <
apply
(subauto[, quant.var],
2
, mean)
substd.df <
apply
(subauto[, quant.var],
2
, sd)
subdistns.df <
rbind
(subranges.df, submeans.df, substd.df)
rownames
(subdistns.df) <
c
(
"min"
,
"max"
,
"mean"
,
"sd."
)
t
(subdistns.df)
pairs
(auto[,

9
])
Questions
Conceptual Questions
Applied Questions
Solutions
Conceptual Questions
Applied Questions
!
PDF
Table of contents
Other Formats