Lab-02

.pdf
Lab 02: Linear Regression Saptarshi Chakraborty 2023-01-29 Libraries The library() function is used to load libraries , or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR2 package, which includes the data sets associated with this book. library(MASS) library(ISLR2) ## ## Attaching package: 'ISLR2' ## The following object is masked from 'package:MASS': ## ## Boston If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS , come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR2 , must be downloaded the first time they are used. This can be done directly from within R . For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR2") . This installation only needs to be done the first time you use a package. However, the library() function must be called within each R session. Simple Linear Regression The ISLR2 library contains the Boston data set, which records medv (median house value) for 506 census tracts in Boston. We will seek to predict medv using 12 predictors such as rmvar (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status). head(Boston) ## crim zn indus chas nox rm age dis rad tax ptratio lstat medv ## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 ## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 1
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 ## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 ## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 ## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7 To find out more about the data set, we can type ?Boston . We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic syntax is lm(y ~ x, data) , where y is the response, x is the predictor, and data is the data set in which these two variables are kept. # lm.fit <- lm(medv ~ lstat) The command causes an error because R does not know where to find the variables medv and lstat . The next line tells R that the variables are in Boston . If we attach Boston , the first line works fine because R now recognizes the variables. lm.fit <- lm(medv ~ lstat, data = Boston) attach(Boston) lm.fit <- lm(medv ~ lstat) If we type lm.fit , some basic information about the model is output. For more detailed information, we use summary(lm.fit) . This gives us p -values and standard errors for the coefficients, as well as the R 2 statistic and F -statistic for the model. summary(lm.fit) ## ## Call: ## lm(formula = medv ~ lstat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.168 -3.990 -1.318 2.034 24.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.55384 0.56263 61.41 <2e-16 *** ## lstat -0.95005 0.03873 -24.53 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 ## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16 We can use the names() function in order to find out what other pieces of information are stored in lm.fit . Although we can extract these quantities by name—e.g. lm.fit$coefficients —it is safer to use the ex- tractor functions like coef() to access them. names(lm.fit) 2
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model" coef(lm.fit) ## (Intercept) lstat ## 34.5538409 -0.9500494 Confidence Intervals In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command. %Type confint(lm.fit) at the command line to obtain the confidence intervals. confint(lm.fit, level = 0.95 ) ## 2.5 % 97.5 % ## (Intercept) 33.448457 35.6592247 ## lstat -1.026148 -0.8739505 The predict() function can be used to produce confidence intervals and prediction intervals for the predic- tion of medv for a given value of lstat . predict(lm.fit, data.frame( lstat = (c( 5 , 10 , 15 )))) ## 1 2 3 ## 29.80359 25.05335 20.30310 plot(lstat, medv) abline(lm.fit) 10 20 30 10 20 30 40 50 lstat medv 3
Uploaded by ChiefCrownGrouse38 on coursehero.com