1
. Prove that the Least Squared coefficient estimates (LSE) for
and
are:
Solution
2
. Prove that the estimates in
Q1
are unbiased.
Solution
3
. Prove that the MLE estimates of
and
are equal to the ones given by LSE (from
Q1
).
Solution
4
. Prove SST=SSE+SSM
Solution
5
. Express SSM in terms of a)
and b)
Solution
6
. Prove the following variance formulas:
Solution
7
. Prove
, where
.
Solution
8
. Prove:
Solution
9
. Forensic scientists use various methods for determining the likely time of death from postmortem
examination of human bodies. A recently suggested objective method uses the concentration of a
compound (3methoxytyramine or 3MT) in a particular part of the brain. In a study of the
relationship between postmortem interval and the concentration of 3MT, samples of the
approximate part of the brain were taken from coroners cases for which the time of death had been
determined form eyewitness accounts. The intervals (
; in hours) and concentrations (
; in parts
per million) for 18 individuals who were found to have died from organic heart disease are given in
the following table. For the last two individuals (numbered 17 and 18 in the table) there was no eye
witness testimony directly available, and the time of death was established on the basis of other
evidence including knowledge if the individuals' activities.
Observation number
Interval (
)
Concentration (
)
1
5.5
3.26
2
6.0
2.67
3
6.5
2.82
4
7.0
2.80
5
8.0
3.29
6
12.0
2.28
7
12.0
2.34
8
14.0
2.18
9
15.0
1.97
10
15.5
2.56
11
17.5
2.09
12
17.5
2.69
13
20.0
2.56
14
21.0
3.17
15
25.5
2.18
16
26.0
1.94
17
48.0
1.57
18
60.0
0.61
,
,
,
,
In this investigation you are required to explore the relationship between concentration (regarded the
responds/dependent variable) and interval (regard as the explanatory/independent variable).
a. Construct a scatterplot of the data. Comment on any interesting features of the data and
discuss briefly whether linear regression is appropriate to model the relationship between
concentration of 3MT and the interval from death.
b. Calculate the correlation coefficient for the data, and use it to test the null hypothesis that the
population correlation coefficient is equal to zero.
c. Calculate the equation of the leastsquares fitted regression line and use it to estimate the
concentration of 3MT:
i. after 1 day and
ii. after 2 days.
Comment briefly on the reliability of these estimates.
d. Calculate a 99% confidence interval for the slope of the regression line. Using this confidence
interval, test the hypothesis that the slope of the regression line is equal to zero. Comment on
your answer in relation to the answer given in part (2) above.
Solution
10
. A university wishes to analyse the performance of its students on a particular degree course. It
records the scores obtained by a sample of 12 students at the entry to the course, and the scores
obtained in their final examinations by the same students. The results are as follows:
Student
A
B
C
D
E
F
G
H
I
J
K
L
Entrance exam score
(%)
86
53
71
60
62
79
66
84
90
55
58
72
Final paper score (%)
75
60
74
68
70
75
78
90
85
60
62
70
,
,
,
,
a. Calculate the fitted linear regression equation of
on
.
b. Assuming the full normal model, calculate an estimate of the error variance
and obtain a 90%
confidence interval for
.
c. By considering the slope parameter, formally test whether the data is positively correlated.
d. Find a 95% confidence interval for the mean finals paper score corresponding to an individual
entrance score of 53.
e. Test whether this data come form a population with a correlation coefficient equal to 0.75.
f. Calculate the proportion of variance explained by the model. Hence, comment on the fit of the
model.
Solution
11
. Complete the following ANOVA table for a simple linear regression with
observations:
Source
D.F.
Sum of Squares
Mean Squares
FRatio
Regression
____
____
____
Error
____
8.2
Total
Solution
12
. Suppose you are interested in relating the accounting variable EPS (earnings per share) to the
market variable STKPRICE (stock price). Then, a regression equation was fitted using STKPRICE as
the response variable with EPS as the regressor variable. Following is the computer output from your
fitted regression. You are also given that:
,
,
, and
(Note that:
and
)
a. Calculate the correlation coefficient of EPS and STKPRICE.
b. Estimate the STKPRICE given an EPS of
$
2. Provide a 95% confidence interval of your estimate.
c. Provide a 95% confidence interval for the slope coefficient
.
d. Compute and
.
e. Describe how you would check if the errors have constant variance.
f. Perform a test of the significance of EPS in predicting STKPRICE at a level of significance of 5%.
g. Test the hypothesis
against
at a level of significance of 5%.
Solution
13
. (Modified from an Institute of Actuaries exam problem) An insurance company issues house
buildings policies for houses of similar size in four different postcode regions
,
,
, and
. An
insurance agent takes independent random samples of
house buildings policies for houses of
similar size in each of the four regions. The annual premiums (in dollars) were as follows:
Region
229 241 270 256 241 247 261 243 272 219
Region
261 269 284 268 249 255 237 270 269 257
Region
253 247 244 245 221 229 245 256 232 269
Region
279 268 290 245 281 262 287 257 262 246
Perform a oneway analysis of variance at the
level to compare the premiums for all four regions.
State briefly the assumptions required to perform this analysis of variance.
Solution
14
. (Past Institute Exam) As part of an investigation into health service funding a working party was
concerned with the issue of whether mortality could be used to predict sickness rates. Data on
standardised mortality rates and standardised sickness rates collected for a sample of 10 regions
and are shown in the table below:
Region
Mortality rate
(per 100,000)
Sickness rate
(per 100,000)
1
125.2
206.8
2
119.3
213.8
3
125.3
197.2
4
111.7
200.6
5
117.3
189.1
6
100.7
183.6
7
108.8
181.2
8
102.0
168.2
9
104.7
165.2
10
121.1
228.5
Data summaries:
,
,
,
,
and
.
a. Calculate the correlation coefficient between the mortality rates and the sickness rates and
determine the probabilityvalue for testing whether the underlaying correlation coefficient is
zero against the alternative that it is positive.
b. Noting the issue under investigation, draw an appropriate scatterplot for these data and
comment on the relationship between the two rates.
c. Determine the fitted linear regression of sickness rate on mortality rate and test whether the
underlaying slope coefficient can be considered to be as large as 2.0.
d. For a region with mortality rate 115.0, estimate the expected sickness rate and calculate 95%
confidence limits for this expected rate.
Solution
15
. (Past institute Exam) Consider the following data, which comprise of four groups sizes (
), each
comprising four observations. In scenario I, information is also given on the sum assured under the
policy concerned  the sum assured is the same for all four policies in a group. In scenario II, we
regard the policies in the different groups as having been issued by four different companies  the
policies in a group are all issued the same company.
All monetary amounts are in units of
£
10,000. Summaries of the claim sizes in each group are given
in a second table.
Group
1
2
3
4
Claim sizes
0.11 0.46
0.52 1.43
1.48 2.05
1.52 2.36
0.71 1.45
1.84 2.47
2.38 3.31
2.95 4.08
I: sum assured
1
2
3
4
II: Company
A
B
C
D
Summaries of claim sizes:
Group
1
2
3
4
2.73
6.26
9.22
10.91
2.8303
11.8018
23.0134
33.2289
a. In scenario I, suppose we adopt the linear regression model
where
is the
claim size and
is the corresponding sum assured,
.
i. Calculate the total sum of squares and its partition into the regression (model) sum of
squares and the residual (error) sum of squares.
ii. Fit the model and calculate the fitted values for the first claim size of group 1 (namely 0.11)
and the last claim size of group 4 (namely 4.08).
iii. Consider a test of the hypothesis
against a twosided alterative. By preforming
appropriate calculations, assess the strength of the evidence against this "no linear
relationship" hypothesis.
b. In scenario II, suppose we adopt the analysis of variance model
where
is the
claim size for company and
is the
company effect,
and
.
i. Calculate the partition of the total sum of squared into the "between companies" (model)
sum of squares and the "within companies" (residual/error) sum of squares.
ii. Fit the model.
iii. Calculate the fitted values for the first claim size of group 1 and the last claim size of group
4.
iv. Consider a test of hypothesis
,
against a general alternative.
By preforming appropriate calculations, assess the strength of the evidence against this "no
company effects" hypothesis.
Solution
1
. Describe the null hypotheses to which the pvalues given in Table 3.4 correspond. Explain what
conclusions you can draw based on these
values. Your explanation should be phrased in terms of
sales
,
TV
,
radio
, and
newspaper
, rather than in terms of the coefficients of the linear model.
Solution
2
. Suppose we have a data set with five predictors,
GPA,
IQ,
Level (1 for College
and 0 for High School),
Interaction between GPA and IQ, and
Interaction between GPA
and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use
least squares to fit the model, and get
,
,
,
,
,
.
a. Which answer is correct, and why
?
i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college
graduates.
ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school
graduates.
iii. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college
graduates provided that the GPA is high enough.
iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school
graduates provided that the GPA is high enough.
b. Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.
c. True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very
little evidence of an interaction effect. Justify your answer.
Solution
3
. I collect a set of data (
observations) containing a single predictor and a quantitative
response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e.
.
a. Suppose that the true relationship between X and Y is linear, i.e.
. Consider
the training residual sum of squares (RSS) for the linear regression, and also the training RSS for
the cubic regression. Would we expect one to be lower than the other, would we expect them to
be the same, or is there not enough information to tell
?
Justify your answer.
b. Answer (a) using test rather than training RSS.
c. Suppose that the true relationship between X and Y is not linear, but we don't know how far it is
from linear. Consider the training RSS for the linear regression, and also the training RSS for the
cubic regression. Would we expect one to be lower than the other, would we expect them to be
the same, or is there not enough information to tell
?
Justify your answer.
d. Answer (c) using test rather than training RSS.
Solution
a.
4
.
Write down the design matrix for the simple linear regression model.
b. Write out the matrix
for the simple linear regression model.
c. Write out the matrix
for the simple linear regression model.
d. Write out the matrix
for the simple linear regression model.
e. Calculate
using your results above.
Where
is the vector of the response variable and
is the vector of coefficients
Solution
5
. The following model was fitted to a sample of supermarkets in order to explain their profit levels:
where
profits, in thousands of dollars
food sales, in tens of thousands of dollars
nonfood sales, in tens of thousands of dollars, and
store size, in thousands of square feet.
The estimated regression coefficients are given below:
Which of the following is TRUE
?
a. A dollar increase in food sales increases profits by 2.7 cents.
b. A 2.7 cent increase in food sales increases profits by a dollar.
c. A 9.7 cent increase in nonfood sales decreases profits by a dollar.
d. A dollar decrease in nonfood sales increases profits by 9.7 cents.
e. An increase in store size by one square foot increases profits by 52.5 cents.
Solution
6
. In a regression model of three explanatory variables, twentyfive observations were used to calculate
the least squares estimates. The total sum of squares and regression sum of squares were found to
be
and
, respectively. Calculate the adjusted coefficient of determination (i.e adjusted
).
a. 89.0%
b. 89.4%
c. 89.9%
d. 90.3%
e. 90.5%
Solution
7
. In a multiple regression model given by:
which of the following gives a correct expression for the coefficient of determination (i.e
)
?
I.
II.
III.
Options:
a. I only
b. II only
c. III only
d. I and II only
e. I and III only
Solution
8
. The ANOVA table output from a multiple regression model is given below:
Source
D.F.
SS
MS
FRatio
Prob(
)
Regression
5
13326.1
2665.2
13.13
0.000
Error
42
8525.3
203.0
Total
47
21851.4
Compute the adjusted coefficient of determination (i.e adjusted
).
a. 52%
b. 56%
c. 61%
d. 63%
e. 68%
Solution
9
. You have information on 62 purchases of Ford automobiles. In particular, you have the amount paid
for the car
in hundreds of dollars, the annual income of the individuals
in hundreds of dollars,
the sex of the purchaser (
, 1=male and 0=female) and whether or not the purchaser graduated
from college (
, 1=yes, 0=no). After examining the data and other information available, you decide
to use the regression model:
You are given that:
and the mean square error for the model is
Calculate
.
a. 0.17
b. 17.78
c. 50.04
d. 55.54
e. 57.43
Solution
10
. Suppose in addition to the information in question 21., you are given:
Calculate the expected difference in the amount spent to purchase a car between a person who
graduated from college and another one who did not.
a. 233.5
b. 1604.3
c. 2195.3
d. 4920.6
e. 6472.1
Solution
11
. A regression model of
on four independent variables
and
has been fitted to a data
consisting of
observations and the computer output from estimating this model is given below:
Regression Analysis
The regression equation is
y = 3894  50.3 x1 + 0.0826 x2 + 0.893 x3 + 0.137 x4
Predictor
Coef
SE Coef
T
Constant
3893.8
409.0
9.52
x1
50.32
9.062
5.55
x2
0.08258
0.02133
3.87
x3
0.89269
0.04744
18.82
x4
0.13677
0.05303
2.58
Which of the following statement is NOT true
?
a. All the explanatory variables have insignificant influence on
.
b. The variable
is a significant variable.
c. The variable
is a significant variable.
d. The variable
is a significant variable.
e. The variable
is a significant variable.
Where
's are vectors of explanatory variables and
is the vector of response variable
Solution
12
. The estimated regression model of fitting life expectancy from birth (LIFE_EXP) on the country's
gross national product (in thousands) per population (GNP) and the percentage of population living
in urban areas (URBAN%) is given by:
For a particular country, its URBAN% is 60 and its GNP is 3.0. Calculate the estimated life
expectancy at birth for this country.
a. 49
b. 50
c. 57
d. 60
e. 65
Solution
13
. What is the use of the scatter plot of the fitted values and the residuals
?
a. to examine the normal distribution assumption of the errors
b. to examine the goodness of fit of the regression model
c. to examine the constant variation assumption of the errors
d. to test whether the errors have zero mean
e. to examine the independence of the errors
Solution
1
. Consider a
nearest neighbours model where
,
, and the
estimated model is
. The weight function is
. Show that
Where
are
's
nearest neighbours. Note that:
Regression Analysis
The regression equation is
STKPRICE = 25.044 + 7.445 EPS
Predictor
Coef
SE Coef
T
p
Constant
25.044
3.326
7.53
0.000
EPS
7.445
1.144
6.51
0.000
Analysis of Variance
SOURCE
DF
SS
MS
F
p
Regression
1
10475
10475
42.35
0.000
Error
46
11377
247
Total
47
21851
ACTL3142 and ACTL5110
Lab 2: Linear Regression
AUTHOR
Questions
Conceptual Questions
!
Simple linear regression questions
ˆ
β
0
ˆ
β
1
ˆ
β
0
=

y

ˆ
β
1
x

ˆ
β
1
=
∑
n
i
=1
(
x
i

x
)
⋅
(
y
i


y
)
∑
n
i
=1
(
x
i

x
)
2
=
S
xy
S
xx


ˆ
β
0
ˆ
β
1
β
1
β
2
1
V
(
ˆ
β
0

X
=
x
)
=
σ
2
(
1
n
+
x
2
S
xx
)
V
(
ˆ
β
1

X
=
x
)
=
σ
2
S
xx
Cov
(
ˆ
β
0
,
ˆ
β
1

X
=
x
)
=

x
σ
2
S
xx





V
(ˆ
y
0
) =
(
1
n
+
(
x

x
0
)
2
S
xx
)
σ
2

ˆ
y
0
=
E
(
y

X
=
x
0
) =
ˆ
β
0
+
ˆ
β
1
x
0
E
[
Y
i

ˆ
y
i

X
=
x
,
X
=
x
i
] = 0

V
(
Y
i

ˆ
y
i

X
=
x
,
X
=
x
i
) =
σ
2
(
1 +
1
n
+
(
x

x
i
)
2
S
xx
)



x
y
x
y
∑
x
= 337
∑
x
2
= 9854.5
∑
y
= 42.98
∑
y
2
= 109.7936
∑
xy
= 672.8
x
y
∑
x
= 836
∑
y
= 867
∑
x
2
= 60, 016
∑
y
2
= 63, 603
∑
(
x

x
)(
y


y
) = 1, 122

y
x
σ
2
σ
2
60
____
____
____
639.5
x
= 2.338


y
= 40.21
s
x
= 2.004
s
y
= 21.56.
s
2
x
=
S
xx
n

1
s
2
y
=
S
yy
n

1
β
s
R
2
H
0
:
β
= 24
H
a
:
β
> 24
A B C
D
10
A
:
(
∑
x
= 2, 479,
∑
x
2
= 617, 163
)
B
:
(
∑
x
= 2, 619,
∑
x
2
= 687, 467
)
C
:
(
∑
x
= 2, 441,
∑
x
2
= 597, 607
)
D
:
(
∑
x
= 2, 677,
∑
x
2
= 718, 973
)
5%
m
s
∑
m
= 1136.1
∑
m
2
= 129, 853.03
∑
s
= 1934.2
∑
s
2
= 377, 700.62
∑
ms
= 221, 022.58
y
y
x
∑
y
∑
y
2
Y
i
=
α
+
β
x
i
+
i
Y
i
i
th
x
i
i
= 1, ... , 16
H
0
:
β
= 0
Y
ij
=
μ
+
τ
i
+
e
ij
Y
ij
j
th
i
τ
i
i
th
i
= 1, 2, 3, 4
j
=
A
,
B
,
C
,
D
H
0
:
τ
i
= 0
i
=
A
,
B
,
C
,
D
Multiple linear regression questions
p
X
1
=
X
2
=
X
3
=
X
4
=
X
5
=
β
0
= 50
β
1
= 20
β
2
= 0.07
β
3
= 35
β
4
= 0.01
β
5
=

10
n
= 100
Y
=
β
0
+
β
1
X
+
β
2
X
2
+
β
3
X
3
+
Y
=
β
0
+
β
1
X
+
X
X
X
Y

(
X
X
)

1
ˆ
β
= (
X
X
)

1
X
Y


Y

ˆ
β

y
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
ε
y
=
x
1
=
x
2
=
x
3
=
ˆ
β
1
= 0.027 and
ˆ
β
2
=

0.097 and
ˆ
β
3
= 0.525.
666.98
610.48
R
2
y
=
β
0
+
β
1
x
1
+ ... +
β
p

1
x
p

1
+
ε
,
R
2
SSM
SST
SST

SSE
SST
SSM
SSE
>
F
R
2
y
x
1
x
2
x
3
y
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
ε
.
(
X
X
)

1
=
0.109564

0.000115

0.035300

0.026804

0.000115
0.000001

0.000115

0.000091

0.035300

0.000115
0.102446
0.023971

0.026804

0.000091
0.023971
0.083184
s
2
= 30106.
SE(
ˆ
β
2
)
X
Y
=
.

9 558
4 880 937
7 396
6 552
y
x
1
,
x
2
,
x
3
x
4
212
y

x
1

x
2

x
3

x
4

x
i

y

LIFE_EXP = 48.24 + 0.79 GNP + 0.154 URBAN%.
KNN question
k
Y
=
f
(
X
) +
E
( ) = 0,
V
( ) =
σ
2
^
f
(
x
)
1
k
EPE
k
(
x
0
) =
σ
2
+
f
(
x
0
)

1
k
∑
l
∈
N
(
x
0
)
f
(
x
(
l
)
)
2
+
σ
2
k
N
(
x
0
)
x
0
k
EPE
k
(
x
0
) =
E
[(
Y

^
f
(
x
0
))
2

X
=
x
0
]
Questions
Conceptual Questions
Simple linear regression
questions
Multiple linear
regression questions
KNN question
Applied Questions
Solutions
Conceptual Questions
Simple linear regression
questions
Multiple linear
regression questions
KNN question
Applied Questions
!
PDF
Table of contents
Other Formats