School

University of California, Berkeley **We aren't endorsed by this school

Course

STAT 154

Subject

Statistics

Date

Sep 18, 2023

Type

Other

Pages

12

Uploaded by ChiefCrownGrouse38 on coursehero.com

Lab 02: Linear Regression
Saptarshi Chakraborty
2023-01-29
Libraries
The
library()
function is used to load
libraries
, or groups of functions and data sets that are not included
in the base
R
distribution.
Basic functions that perform least squares linear regression and other simple
analyses come standard with the base distribution, but more exotic functions require additional libraries.
Here we load the
MASS
package, which is a very large collection of data sets and functions. We also load the
ISLR2
package, which includes the data sets associated with this book.
library(MASS)
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
##
##
Boston
If you receive an error message when loading any of these libraries, it likely indicates that the corresponding
library has not yet been installed on your system.
Some libraries, such as
MASS
, come with
R
and do
not need to be separately installed on your computer.
However, other packages, such as
ISLR2
, must
be downloaded the first time they are used.
This can be done directly from within
R
. For example, on
a Windows system, select the
Install package
option under the
Packages
tab.
After you select any
mirror site, a list of available packages will appear.
Simply select the package you wish to install and
R
will automatically download the package.
Alternatively, this can be done at the
R
command line via
install.packages("ISLR2")
.
This installation only needs to be done the first time you use a package.
However, the
library()
function must be called within each
R
session.
Simple Linear Regression
The
ISLR2
library contains the
Boston
data set, which records
medv
(median house value) for
506
census
tracts in Boston. We will seek to predict
medv
using
12
predictors such as
rmvar
(average number of rooms
per house),
age
(average age of houses), and
lstat
(percent of households with low socioeconomic status).
head(Boston)
##
crim zn indus chas
nox
rm
age
dis rad tax ptratio lstat medv
## 1 0.00632 18
2.31
0 0.538 6.575 65.2 4.0900
1 296
15.3
4.98 24.0
## 2 0.02731
0
7.07
0 0.469 6.421 78.9 4.9671
2 242
17.8
9.14 21.6
1

## 3 0.02729
0
7.07
0 0.469 7.185 61.1 4.9671
2 242
17.8
4.03 34.7
## 4 0.03237
0
2.18
0 0.458 6.998 45.8 6.0622
3 222
18.7
2.94 33.4
## 5 0.06905
0
2.18
0 0.458 7.147 54.2 6.0622
3 222
18.7
5.33 36.2
## 6 0.02985
0
2.18
0 0.458 6.430 58.7 6.0622
3 222
18.7
5.21 28.7
To find out more about the data set, we can type
?Boston
.
We will start by using the
lm()
function to fit a simple linear regression model, with
medv
as the response and
lstat
as the predictor. The basic syntax is
lm(y ~ x, data)
, where
y
is the response,
x
is the predictor,
and
data
is the data set in which these two variables are kept.
# lm.fit <- lm(medv ~ lstat)
The command causes an error because
R
does not know where to find the variables
medv
and
lstat
. The
next line tells
R
that the variables are in
Boston
. If we attach
Boston
, the first line works fine because
R
now recognizes the variables.
lm.fit
<-
lm(medv ~ lstat,
data =
Boston)
attach(Boston)
lm.fit
<-
lm(medv ~ lstat)
If we type
lm.fit
, some basic information about the model is output. For more detailed information, we use
summary(lm.fit)
. This gives us
p
-values and standard errors for the coefficients, as well as the
R
2
statistic
and
F
-statistic for the model.
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -15.168
-3.990
-1.318
2.034
24.500
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384
0.56263
61.41
<2e-16 ***
## lstat
-0.95005
0.03873
-24.53
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:
0.5441, Adjusted R-squared:
0.5432
## F-statistic: 601.6 on 1 and 504 DF,
p-value: < 2.2e-16
We can use the
names()
function in order to find out what other pieces of information are stored in
lm.fit
.
Although we can extract these quantities by name—e.g.
lm.fit$coefficients
—it is safer to use the ex-
tractor functions like
coef()
to access them.
names(lm.fit)
2

##
[1] "coefficients"
"residuals"
"effects"
"rank"
##
[5] "fitted.values" "assign"
"qr"
"df.residual"
##
[9] "xlevels"
"call"
"terms"
"model"
coef(lm.fit)
## (Intercept)
lstat
##
34.5538409
-0.9500494
Confidence Intervals
In order to obtain a confidence interval for the coefficient estimates, we can use the
confint()
command.
%Type
confint(lm.fit)
at the command line to obtain the confidence intervals.
confint(lm.fit,
level =
0.95
)
##
2.5 %
97.5 %
## (Intercept) 33.448457 35.6592247
## lstat
-1.026148 -0.8739505
The
predict()
function can be used to produce confidence intervals and prediction intervals for the predic-
tion of
medv
for a given value of
lstat
.
predict(lm.fit, data.frame(
lstat =
(c(
5
,
10
,
15
))))
##
1
2
3
## 29.80359 25.05335 20.30310
plot(lstat, medv)
abline(lm.fit)
10
20
30
10
20
30
40
50
lstat
medv
3