Slides-chapter 12-2020 (1)

.pdf
Chapter 12: Estimation II: Methods of Estimation Summary of the main points Aris Spanos [ Fall 2020] 1 Introduction What is a practitioner supposed to do in statistical modeling? Stage 1 . A practitioner begins with a data set x 0 :=( 1   2    ) and some questions of substantive interest relating to mechanism that gave rise to x 0 Stage 2 . The practitioner studies several data plots (chapters 5-7) and chooses a statistical model M θ ( x ) so as to account for all the chance regularities in x 0 using probabilistic assumptions from three broad categories: Distribution Normal Bernoulli . . . Dependence Independence Markov dependence . . . Heterogeneity Identically Distributed strict Stationarity . . . x 0 represents a realization of the sample X :=( 1   2    ) from M θ ( x ) . 1
Stage 3 . The probabilistic assumptions comprising M θ ( x ) determine the dis- tribution of the sample ( x ; θ ) x R Stage 4 . Using the Maximum Likelihood (ML) method the practitioner de fi nes the likelihood function: ( θ ; x 0 ) ( x 0 ; θ ) θ Θ which is used to derive ML estimator for of  say b  ( X )= ( X ); chapter 12. Stage 5 . Since b  ( X )= ( X ) is a speci fi c function ( ) of the sample X it is a random variable itself and has its own sampling distribution which can be derived via: ( ; θ )= P ( ; θ )= Z Z · · · Z | {z } { x : ( x ) ; x R } ( x ; θ ) x R (1.0.1) Stage 6 . Using the sampling distribution ( b  ( x ); θ ) x R the practitioner establishes the properties of the b  ( X ) to decide if it is optimal (unbiased, fully e cient, su cient, consistent). Stage 7 . If b  ( X ) is an 'optimal' estimator of  the practitioner proceeds to use it to draw inferences relating to learning from data about ; constructing Con fi dence Intervals, Testing Hypotheses and Predicting ! 2
Warning : What happens if any of the probabilistic assumptions comprising M θ ( x ) is invalid for data x 0 ? All the derivations in stages 3-7 are invalidated and the inferences based on ( b  ( x ); θ ) x R are unreliable ! Missing stage 3.5 : test the validity of the model assumptions using Mis- Speci fi cation (M-S) testing before stage 4; chapter 15. If any of these proba- bilistic assumptions are invalid, the practitioner respeci fi es M θ ( x ) (changes the original assumptions from stage 2 until a statistical adequate model is found. In chapter 11 we discussed estimators and their properties (table 12.1). Table 12.1: Properties of Estimators Finite sample ( 1  ) Asymptotic ( → ∞ ) 1. Unbiasedness, 5. Consistency (weak, strong) 2. Relative E ciency, 6. Asymptotic Normality 3. Full E ciency, 7. Asymptotic Unbiasedness 4. Su ciency, 8. Asymptotic E ciency 1 . Reparametrization invariance, = ( ) 3
Chapter 12 discusses estimation methods for deriving estimators with good prop- erties. Table 12.2: Methods of Estimation 1. The method of Maximum Likelihood (ML) 2. The Least Squares method (LS) 3. The Moment Matching principle (MM) 4. The Parametric Method of Moments (PMM) 2 The Maximum Likelihood Method 2.1 The Likelihood function In contrast to the other methods of estimation, Maximum Likelihood (ML) was speci fi cally developed for the modern model-based approach to statistical infer- ence as framed by Fisher (1912; 1922a; 1925b). This approach turns the Karl Pearson procedure from data to histograms and frequency curves (Appendix 12.A), on its head by viewing the data x 0 :=( 1   2    ) as a typical realiza- tion of the sample X :=( 1   2    ) from a prespeci fi ed stochastic generating 4
mechanism, we call a statistical model : M θ ( x )= { ( x ; θ ) θ Θ R } x R   (2.1.2) The probabilistic assumptions comprising M θ ( x ) are encapsulated by the dis- tribution of the sample ( x ; θ ) x R the joint distribution of the random variables making up the sample . The cornerstone of the ML method is the concept of the likelihood function (Fisher, 1921), de fi ned by: ( θ ; x 0 ) ( x 0 ; θ ) θ Θ where reads 'proportional to'. In light of viewing the statistical model as the stochastic mechanism that generated x 0 :=( 1   2    ) it seems intuitively obvious to evaluate ( x ; θ ) x R at X = x 0 and pose the reverse question: I how likely does ( x 0 ; θ ) render the di ff erent values of θ in Θ to have been the 'true' value θ ? Recall that ' θ denotes the true value of θ ' is a shorthand for saying that 'data x 0 constitute a typical realization of the sample X with distrib- ution ( x ; θ ) x R ', and the primary objective of an estimator b θ ( X ) of θ is to pin-point θ Hence, the likelihood function yields the likelihood (proportional 5
to the probability) of getting x 0 under di ff erent values of θ . In contrast to ( x ; θ ) x R , the LF whose domain is Θ : ( ; x 0 ) : Θ [0 ) and range is [0 ) It re fl ects the relative likelihoods for di ff erent values of θ Θ stemming from data x 0 when viewed through the prism of M θ ( x ) x R Table 12.3: The frequentist approach to statistical inference Statistical model M θ ( x )= { ( x ; θ ) θ Θ } x R = Distribution of the sample ( x ; θ ) x R Data : x 0 := ( 1   2    ) −→ Likelihood function ( θ ; x 0 ) θ Θ 6
Example 12.1 . Consider the simple Bernoulli model , as speci fi ed in table 12.4. Table 12.4: The simple Bernoulli model Statistical GM: = +   N :=(1 2     ) [1] Bernoulli: v Ber (   )   = { 0 1 } [2] Constant mean: ( )=  0 1 for all N [3] Constant variance:   ( )= (1 ) for all N [4] Independence: {   N } - independent process. Assumptions [1]-[4] imply that ( x ; θ ) x R takes the form: ( x ; ) [4] = Q =1 ( ; ) [2]-[4] = Q =1 ( ; ) [1]-[4] = Q =1 (1 ) 1 = = P =1 (1 ) P =1 (1 ) x { 0 1 } (2.1.3) where the reduction in (2.1.3) follows from the cumulative imposition of the assumptions [1]-[4]. Hence, the Likelihood Function (LF) takes the form: ( ; x 0 ) P =1 (1 ) P =1 (1 ) = (1 ) ( )   [0 1] (2.1.4) 7
where =( P =1 ) v Bin (   (1 ); )) (chapter 11), i.e. is Binomially distributed, and constitutes a minimal su cient statistic for  which means that no information in ( x ; ) relevant for is lost by replacing X with  That is, the distribution ( ; )   =1 2    in fi gure 12.1 (for =100   = 56) is a one-dimensional representation of ( x ; ) x { 0 1 } , which is discrete, but LF ( ; x 0 ) (1 )   [0 1] (after rescaling) in fi g. 12.2 is a continuous and di ff erentiable function of [0 1] 75 70 65 60 55 50 45 40 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 y Probability Fig. 12.1: ( ; )   =1 2    Fig. 12.2: ( ; x 0 ) , [0 1] Scaling of the likelihood function is needed since: ( θ ; x 0 )= ( x 0 ) · ( x 0 ; θ ) θ Θ (2.1.5) 8
where ( x 0 ) depends only on the sample realization x 0 and not on θ Example 12.2 . For the simple Bernoulli model (table 12.4), assume that in a sample of size =20 , the observed is = P =1 =17 Let us compare two values of = P ( =1) within the interval [0 1]   = 66 and = 9 In view of =17 , the LF will assign a much higher likelihood to = 9 [ ( 9; x 0 )=( 9) 17 (1 9) 3 ] than to = 66 [ ( 66; x 0 )=( 66) 17 (1 66) 3 ]; see fi g. 12.2 for the general shape. Note that due to the presence of the arbitrary constant ( x 0 ) in (2.1.5), the only meaningful measure of relative likelihood comes in the form of the ratio: ( 9; x 0 ) ( 66; x 0 ) = ( 9) 17 (1 9) 3 ( 66) 17 (1 66) 3 =4 959 which, in this case, renders the value = 9 almost 5 times likelier than = 66 Does that mean that this provides evidence that = 9 is close to the the true ? No! The likelihood values are dominated by the Maximum Likelihood (ML) estimate = 85 Maximum Likelihood method and learning from data : P ( lim →∞ h ln( 1 ( ; x ) 1 ( ; x ) ) i 0)=1 [ Θ { } ] (2.1.6) This result follows directly from applying the SLLN to 1 P =1 ln ( ; ) . 9
2.2 Maximum Likelihood estimators Estimating by maximum likelihood amounts to fi nding that particular value b  = ( x ) that maximizes the likelihood function: ( b  ; x 0 )=max Θ ( ; x 0 ) ⇐⇒ b  = arg[max Θ ( ; x 0 )] (2.2.7) but then turn it into a statistic (a function of X ) That is, b  ( X )= ( X ) is the Maximum Likelihood Estimator (MLE) of and b  ( x 0 )= ( x 0 ) is the ML estimate . There are several things to note about MLE in (2.2.7): (a) the MLE b  ( X ) may not exist , (b) the MLE b  ( X ) may not be unique , (c) the MLE may not have a closed form expression b  ( X )= ( X ) . Example 12.3 . Consider the simple Uniform model : v UIID ( 1 2   + 1 2 )   R   =1 2     where ( ; )=1   [ 1 2   + 1 2 ] and: ( x ; ) = Q =1 1=1 x [ 1 2   + 1 2 ] 10
The likelihood function is: ( ; x )=1 if 1 2 [1] and [ ] + 1 2 The preferred ML estimator is the midrange of , b  ( X )= [ ] + [1] 2 : ( b  ( X ))=    ( b  ( X ))= 1 2( +1)( +2) vs. ( )=    ( )= 1 (a) b  ( X ) is non-unique since statistical model is non-regular; the support of ( x ; ) depends on . (b) b  ( X ) is relatively more e cient than = 1 P =1 In practice b  ( X ) exists and is unique in the overwhelming number of cases of interest, when two additional restrictions to R1-R4 in table 11.4 are imposed on M ( x ) (table 12.5). Table 12.5: Regularity for M ( x )= { ( x ; θ ) θ Θ } x R ( R5 ) ( ; x 0 ) : Θ [0 ) is continuous at all points θ Θ ( R6) For all values θ 1 6 = θ 2 in Θ   ( x ; θ 1 ) 6 = ( x ; θ 2 ) x R 11
Condition ( R5 ) ensures that ( θ ; x ) is smooth enough to locate its maximum, and ( R6 ) ensures that θ is identi fi able and thus unique. When the LF is also di ff erentiable, one can locate the maximum by solving the fi rst-order conditions:  ( ; x )  = ( b  )=0 given that 2 ( ; x ) 2 ¯ ¯ ¯ = b  0 In practice, it is often easier to maximize the log likelihood function instead, be- cause they have the same maximum (the logarithm is a monotonic transformation): ln ( ; x )  = ( b  )= ¡ 1 ¢  ( ; x )  = ¡ 1 ¢ ( b  )=0 given 6 =0 Example 12.4 . For the simple Bernoulli model (table 12.4): ln ( x ; )=( P =1 ) ln +( P =1 [1 ]) ln(1 )= ln +( ) ln(1 ) (2.2.8) where = P =1 Solving the fi rst order condition: ln ( x ; )  =( 1 ) ( 1 1 )( )=0 (1 )= ( )  = b  = 1 P =1 b  is a maximum of ln ( x ; ) when 2 ln ( x ; )  2 ¯ ¯ ¯ = b  0 : 12
2 ln ( x ; )  2 ¯ ¯ ¯ = b  = ( 1 2 ) ( 1 1 ) 2 ( )= (  2 +2   2   2 ) 2 ( 1) 2 ¯ ¯ ¯ ¯ = b  = 3 ( ) 0 because both the numerator ( 3 ) and denominator ( ( )   ) are positive. Example 12.5 . Consider the simple Laplace model (table 12.6) whose density function is: ( ; )= 1 2 exp { | |}   R   R Table 12.6: The simple Laplace model Statistical GM: = +   N :=(1 2     ) [1] Laplace: v Lap (   ) [2] Constant mean: ( )=  for all N [3] Constant variance:   ( )=2 for all N [4] Independence: {   N } - independent process. The distribution of the sample takes the form: ( x ; ) = Y =1 1 2 exp { | |} =( 1 2 ) exp { P =1 | |} x R 13
and thus the log-likelihood function is: ln ( ; x )= const ln(2) P =1 | |   R Since ln ( ; x ) is non-di ff erentiable one needs to use alternative methods to derive the maximum of this function. In this case maximizing ln ( ; x ) with respect to is equivalent to minimizing the function: ( )= P =1 | | which (in the case of odd) gives rise to the sample median: b  = median ( 1   2     ) 2.3 The Score function The quantity  ln ( ; x ) has been encountered in chapter 11 in relation to full e ciency, but at that point we used the log of the distribution of the sample ln ( x ; ) instead of ln ( ; x ) to de fi ne the Fisher information : I ( ):= ½ ³ ln ( x ; )  ´ 2 ¾ (2.3.9) 14
In terms of the log-likelihood function the Cramer-Rao (C-R) lower bound takes the form:   ( b ) ½ ³ ln ( ; x )  ´ 2 ¾¸ 1 (2.3.10) for any unbiased estimator b of  A short digression . From a mathematical perspective: ½ ³ ln ( x ; )  ´ 2 ¾ = ½ ³ ln ( ; x )  ´ 2 ¾ but the question is which choice between ln ( x ; ) and ln ( ; x ) provides a correct way to express the C-R bound in a probabilistically meaningful way. Neither of these concepts is entirely correct . What is implicitly assumed in the derivation of the C-R bound is a more general real-valued function with two arguments: (   ) : ( R × Θ ) R such that: (i) for a given x = x 0   ( x 0 ; ) ( ; x 0 )   Θ and (ii) for a fi xed  say =   ( x ; )= ( x ; ) x R The fi rst derivative of the log-likelihood function, when interpreted as a function 15
of the sample X de fi nes: the score function : s ( ; x )=  ln ( ; x ) x R that satis fi es the properties in table 12.7. Table 12.7: Score function: Properties ( Sc1 ) [ s ( ; X )]=0 ( Sc2)   [ s ( ; X )]= [ ( ; X )] 2 = ³ 2  2 ln ( ; X ) ´ := I ( ) That is, the Fisher information is the variance of the score function. As shown in the previous chapter, an unbiased estimator b ( X ) of achieves the Cramer-Rao (C-R) lower bound if and only if ( b ( X ) ) can be expressed in the form: ( b ( X ) )= ( ) · s ( ; X ) for some function ( ) Example 12.6 . In the case of the Bernoulli model the score function is: s ( ; X ):=  ln ( ; X )= ¡ 1 ( 1 1 )( ) ¢ 16
h (1 ) i s ( ; X )= 1 (  ) =( b  ) ( b  )= h (1 ) i s ( ; X ) which implies that b  = 1 P =1 achieves the C-R lower bound:   ( b  )= C-R ( )= (1 ) con fi rming the result in example 11.15. Example 12.7 . Consider the simple Exponential model in table 12.8. Table 12.8: The simple Exponential model Statistical GM: = +   N :=(1 2     ) [1] Exponential: v Exp (   )   R + [2] Constant mean: ( )=   R N [3] Constant variance:   ( )= 2 N [4] Independence: {   N } -independent process. 17
Assumptions [1]-[4] imply that ( x ; θ ) x R takes the form: ( x ; ) [4] = Q =1 ( ; ) [2]-[4] = Q =1 ( ; )= [1]-[4] = Q =1 1 exp © ª = ¡ 1 ¢ exp © 1 P =1 ª x R + ln ( ; x )= ln( ) 1 P =1  ln ( ; x )= + 1 2 P =1 =0 b  = 1 P =1 The second-order condition: 2  2 ln ( ; x ) ¯ ¯ ¯ = b  = 2 2 3 P =1 ¯ ¯ ¯ = b  = b 2  0 ensures that ln ( b ; x ) is a maximum and not a minimum or a point of in fl ection. Using the second derivative of the log-likelihood function we can derive the Fisher information: I ( ):= ³ 2  2 ln ( ; x ) ´ = 2 The above results suggest that the ML estimator b  = 1 P =1 is both unbi- ased and fully e cient (verify!). 18
2.4 Two-parameter statistical model In the case where θ contains more than one parameter, say θ :=( 1   2 ) the fi rst-order conditions for the MLEs take the form of a system of equations: ln ( θ ; x )  1 =0 ln ( θ ; x )  2 =0 which need to be solved simultaneously in order to derive the MLEs b θ  ( X ) . Moreover, the second order conditions for a maximum are more involved that the one-parameter case since they involve three restrictions: (i) det 2 ln ( θ ; x )  2 1 2 ln ( θ ; x )  1  2 2 ln ( θ ; x )  2  1 2 ln ( θ ; x )  2 2 ¯ ¯ ¯ ¯ ¯ ¯ b θ  0 (ii) 2 ln ( θ ; x )  2 1 ¯ ¯ ¯ b θ  0 and(iii) 2 ln ( θ ; x )  2 2 ¯ ¯ ¯ b θ  0 Note that when (ii) and (iii) are positive then the optimum is a minimum. The Fisher information matrix is de fi ned by: I ( θ )= ³ ln ( θ ; x ) θ ln ( θ ; x ) θ > ´ = ³ 2 ln ( θ ; x ) θ θ > ´ =  ³ ln ( θ ; x ) θ ´ 19
Example 12.8 . Consider the simple Normal model in table 12.9. Table 12.9: The simple Normal model Statistical GM: = +   N :=(1 2     ) [1] Normal: v N (   )   R [2] Constant mean: ( )=   R N [3] Constant variance:   ( )= 2 N [4] Independence: {   N } -independent process. Assumptions [1]-[4] imply that ( x ; θ ) x R takes the form: ( x ; θ ) [4] = Q =1 ( ; θ ) [2]-[4] = Q =1 ( ; θ )= [1]-[4] = Q =1 1 2  2 exp( ( ) 2 2 2 )=( 1 2  2 ) exp © 1 2 2 P =1 ( ) 2 ª x R (2.4.11) Hence, the log-likelihood function is: ln (   2 ; x )= const. 2 ln 2 1 2 2 P =1 ( ) 2 20
Hence, we can derive the MLEs of and 2 via the fi rst-order conditions: ln ( θ ; x )  = 1 2 2 ( 2) P =1 ( )=0 ln ( θ ; x )  2 = 2 2 + 1 2 4 P =1 ( ) 2 =0 Solving these for and 2 yields: b  = 1 P =1 b 2  = 1 P =1 ( b  ) 2 Again, the MLEs coincide with the estimators suggested by the other three meth- ods. ln ( b θ ; x ) for b θ :=( b  ˆ 2 ) is indeed a maximum since the second derivatives at θ = b θ  take the following signs: 2 ln ( θ ; x )  2 ¯ ¯ ¯ b θ  = ¡ 2 ¢ ¯ ¯ b θ  = ˆ 2 0 2 ln ( θ ; x )  2  ¯ ¯ ¯ b θ  = 1 4 P =1 ( ) ¯ ¯ ¯ ¯ b θ  =0 2 ln ( θ ; x )  4 ¯ ¯ ¯ b θ  = 2 4 1 6 P =1 ( ) 2 ¯ ¯ ¯ ¯ b θ  = 2 ˆ 6 0 ³ 2 ln ( θ ; x )  2 ´ ³ 2 ln ( θ ; x )  4 ´ ³ 2 ln ( θ ; x )  2  ´ ¯ ¯ ¯ θ = b θ  0 21
The second derivatives and their expected values for the simple Normal model were derived in section 11.6 and yielded the following Fisher Information matrix and the C-R lower bounds for any unbiased estimators of and 2 : I ( θ )= μ 2 0 0 2 4 (a) C-R ( )= 2 (b) C-R ( 2 )= 2 4 In addition, the sampling distributions of the MLEs take the form (section 11.6): (i) b  v N (  2 ) (ii) ( b 2  2 ) v 2 ( 1) (2.4.12) b  is an unbiased , fully e cient , su cient , consistent, asymptotically Normal, asymptotically e cient estimator of  b 2  is biased , su cient , consistent , asymptotically Normal and asymptoti- cally e cient . Example 12.9 . Consider the simple Gamma model (table 12.10), with a density function: ( ; θ )= 1 Γ [ ] ³ ´ 1 exp n ³ ´o θ :=(   ) R 2 +   R + 22
where Γ [ ] is the Gamma function (see Appendix 3.A). Table 12.10: The simple Gamma model Statistical GM: =  +   N :=(1 2     ) [1] Gamma: v G (   )   R + [2] Constant mean: ( )=  (   ) R 2 + N [3] Constant variance:   ( )=  2 N [4] Independence: {   N } -independent process. Assumptions [1]-[4] imply that ( x ; θ ) x R takes the form: ( x ; θ ) [4] = Q =1 ( ; θ ) [2]-[4] = Q =1 ( ; θ )= [1]-[4] = Q =1 ( 1 Γ [ ] ) { ( ) } =( Γ [ ] ) Q =1 ( 1 ) exp n - P =1 o x R + The log-likelihood function, with θ :=(   ) takes the form: ln ( θ ; x )= const ln Γ [ ]  ln + ( 1) P =1 ln P =1 23
The fi rst order conditions yield: ln ( θ ; x )  =  + 1 2 P =1 =0 ln ( θ ; x )  =  0 [ ] ln + P =1 ln =0 where 0 [ ]:=  ln Γ [ ] is known as the digamma function (see Abramowitz and Stegum, 1970). Solving the fi rst equation yields: b  = b where = 1 P =1 Substituting this into the second equation yields: ( )=  0 [ ] ln( b ) + P =1 ln =0 (2.4.13) which cannot be solved explicitly for b . It can, however, be solved numerically. 2.4.1 Numerical evaluation In its simplest form the numerical evaluation amounts to solving numerically the score function equation, ln ( ; x )  =0 being a non-linear function of  One of the simplest and most widely used algorithms is the Newton Raphson. Find the value of in Θ that minimizes the function ( )= ( ln ( ; x )  ) 24
by ensuring that  ( )  := 0 ( ) ' 0 Note that maximizing ( ) is equivalent to min- imizing ( ) . Step 1 . Choose an initial (tentative) best guess 'value': 0 Step 2 . The Newton-Raphson algorithm improves this value 0 by choosing: 1 = 0 [ 0 ( 0 )] 1 ( 0 ) where 0 ( 0 )=  ( 0 )  This is based on taking a fi rst-order Taylor approximation: ( 1 ) ' ( 0 ) + ( 1 0 ) 0 ( 0 ) setting it equal to zero, ( 1 )=0 and solving it for 1 This provides a quadratic approximation of the function ( ) Step 3 . Continue iterating using the algorithm: b +1 = b h 0 ( b ) i 1 ( b )   =1 2    + 1 until the di ff erence between b +1 and b is less than a pre-assigned small value  say = 00001 i.e. ¯ ¯ ¯ b +1 b ¯ ¯ ¯   25
note that h 0 ( b ) i is the observed information (matrix) encountered above. Step 4 . The MLE is chosen to be the value b +1 for which: 0 ( b +1 ) ' 0 A related numerical algorithm, known as the method of scoring , replaces 0 ( b ) with the Fisher information I ( ) the justi fi cation being the convergence result: 1 0 ( b )  I ( ) yielding the sequential iteration scheme: b +1 = b 1 h I ( b ) i 1 ( b )   =1 2    + 1 Example 12.10 . Consider the simple Logistic (one parameter) model (table 12.11), with a density function: ( ; )= exp( ( )) [1+exp( ( ))] 2   R   R 26
Table 12.11: The simple (one-parameter) Logistic model Statistical GM: = +   N :=(1 2     ) [1] Logistic: v Log ( )   R [2] Constant mean: ( )=   R   N [3] Constant variance:   ( )= 2 3 N [4] Independence: {   N } -independent process. Assumptions [1]-4] imply that ln ( ; x ) and the fi rst-order conditions are: ln ( ; x )= - P =1 ( ) 2 P =1 ln £ 1+ - ( ) ) ¤ ln ( ; x )  = 2 P =1 exp( ( )) [1+exp( ( ))] =0 The MLEs of can be derived using the Newton-Raphson algorithm with: 0 ( )= 2 P =1 exp( ) [1+exp( )] 2 and as initial value for  For comparison purposes note that: ( ) v →∞ N (0 2 3 ) 2 3 =3 29 and ( b  ) v →∞ N (0 3) 27
2.5 Properties of Maximum Likelihood Estimators 2.5.1 Finite sample properties Maximum likelihood estimators are not unbiased in general, but instead, they are invariant with respect to well-behaved functional parameterizations, and the two properties are incompatible. (1) Parameterization invariance For = ( ) a well-behaved (Borel) function of ,the MLE of is given by: b  = ( b  ) This property is particularly useful because the substantive (structural) parame- ters of interest ϕ do not often coincide with the statistical parameters θ , and this property enables us to derive the MLEs of the former. In view of the fact that in general: ( ˆ  ) 6 = ( ( b  )) one can think of the bias in certain MLEs as the price to pay for the invariance property. That is, if ( b  )=   ( ˆ  ) 6 = in general. Example 12.11 . For the simple Normal model (table 12.9) b  is an unbiased estimator of  Assuming that the parameter of interest is 2 is b 2  an unbiased 28
estimator? The answer is no since: ( b 2  ) = ¡ 1 P =1 ¢ 2 =( 1 ) 2 h P =1 ( 2 )+ P 6 = ( ) i = [4] =( 1 ) 2 £ ( 2 + 2 )+ ( 1) 2 ¤ = 1 2 ( ¡ 2 +  2 ¢ )= 2 + 2 since ( 2 )= 2 + 2 and ( ) [4] = ( ) · ( )= 2 (2) Unbiasedness & Full e ciency When b is unbiased and attains C-R ( ) , then b = b  Example 12.14 . Consider the simple Poisson model in table 12.14: Table 12.14: The simple Poisson model Statistical GM: = +   N :=(1 2     ) [1] Poisson: v Poisson ( )   N 0 [2] Constant mean: ( )=   R N [3] Constant variance:   ( )=  N [4] Independence: {   N } -independent process. 29
( ; )=( ! )    0   N 0 = { 0 1 2   } Given that = ( ) an obvious unbiased estimator of is b = 1 P =1 since: ( b )= and   ( b )= Is b also fully e cient: Assumptions [1]-[4] imply that ( ; x ) and ln ( ; x )  =0 are: ( ; x )= Y =1 (1 ( )!)= P =1  Y =1 (1 ( )!) ln ( ; x )=( P =1 ) ln  P =1 ln( ) ln ( ; x )  = ¡ + 1 P =1 ¢ 2 ln ( x ; )  2 = ³ 1 2 P =1 ´ I ( θ ) = ( 2 ln ( x ; )  2 )= 1 2 P =1 ( )=  2 = yield C-R ( )= i.e. b is fully e cient and thus it coincides with the ML esti- mator since: ln ( ; x )  =( + 1 P =1 )=0 b  = 1 P =1 = b 30
(3) Su ciency The notion of a su cient statistic is operationalized using the Factorization the- orem . A statistic ( X ) is said to be a su cient statistic for if and only if there exist functions ( ( X ); ) and ( X ) such that: ( x ; )= ( ( x ); ) · ( x ) x R (2.5.14) The result in (2.5.14) suggests that if there exists a su cient statistic ( X ) and the MLE b  ( X ) exists and is unique, then b  ( X )= ( ( X )) because: ( x 0 ; )= [ ( x 0 ) · ( x 0 )] · ( ( x 0 ); ) ( ( x 0 ); ) Θ (2.5.15)  ( x ; )  =  ( ( x ); )  b  = ( ( X )) ensuring that b  = ( ( X )) depends on X only through the su cient statistic. (4) Full E ciency Recalling from chapter 11 that an estimator b ( X ) is fully e cient i ff : ( b ( X ) )= ( ) h ln ( x ; )  i (2.5.16) 31
for some function ( ) implies that ( x 0 ; ) has the form in (2.5.15), and thus if a fully e cient estimator b ( X ) exists, b ( X )= b  ( X ) This suggests that the existence of a su cient statistic is weaker than that of a fully e cient estimator. 2.5.2 Asymptotic properties (IID sample) Let us consider the asymptotic properties of MLEs in the simple IID sample case where: I ( )= I ( ) I ( )= ³ ln ( ; )  ´ 2 0 (2.5.17) where I ( ) is known as Fisher's information for one observation. In addition to R1-R6, we will need the two conditions in table 12.15. Table 12.15: Regularity conditions for ln ( ; x ) Θ (R7 ) (ln ( ; )) exists, (R8 ) 1 ln ( ; x )  (ln ( ; )) Θ (R9 ) ln ( ; x ) is twice di ff erentiable in an open interval around  32
(5) Consistency (a) Weak Consistency . Under these regularity conditions, MLEs are weakly consistent, i.e. for some   0 : lim →∞ P ³ ¯ ¯ ¯ b  ¯ ¯ ¯   ´ =1 and denoted by: b  P  (b) Strong Consistency . Under these regularity conditions, MLEs are strongly consistent: P ( lim →∞ b  = )=1 and denoted by: b    See chapter 9 for a discussion between these two di ff erent modes of convergence. (6) Asymptotic Normality Under the regularity conditions ( R1)-(R9) , MLEs are asymptotically Normal: ( b  ) v →∞ N (0   ( )) (2.5.18) where ( ) denotes the asymptotic variance of b  33
(7) Asymptotic Unbiasedness The asymptotic Normality for MLEs also implies asymptotic unbiasedness: lim →∞ ( b  )= (8) Asymptotic (full) E ciency Under the same regularity conditions the asymptotic variance of maximum like- lihood estimators achieves the asymptotic Cramer-Rao lower bound, which in view of (2.5.17) is: ( b  )= I 1 ( ) Example 12.15 . For the simple Bernoulli model (table 12.4): ( b  ) v →∞ N (0   (1 )) Example 12.16 . For the simple Exponential model (table 12.8): ( b  ) v →∞ N (0   2 ) 34
Example 12.17 . For the simple Logistic model (table 12.11): ( b  ) v →∞ N (0 3) Example 12.18 . For the simple Normal model (table 12.9): ( b  ) v →∞ N (0   2 ) ( b 2  2 ) v →∞ N (0 2 4 ) 2.5.3 Asymptotic properties (Independent (I) but non-ID sample) Independent, but not ID: I ( ) I = P =1 I ( ) I ( )= μ h ln ( ; )  i 2 (2.5.19) Table 12.16: Regularity conditions for I ( ) ( a ) lim →∞ I ( )= b  is consistent, ( b ) { } =1 , lim →∞ ³ 1 2 I ( ) ´ = I ( ) 0 Asymptotic Normality. 35
Asymptotic Normality under these conditions takes the form: ( b  ) v →∞ N (0 I ( )) Example 12.19 . Consider a Poisson model with separable heterogeneity: v PI (  )   ( ; )=  (  ) !   ( )=   ( )=   N   0   = { 0 1 2   } ( ; x )= Y =1 (  ) (  ) (1 ( )!)= Y =1 ( ) ! exp( P =1 ln ) P =1 ln ( ; x )= const + P =1 ln  where = P =1 = 1 2 ( ( +1) ln ( ; x )  =( 1 P =1 )=0 b  = 1 P =1   ( b  )=    ( b  )= The question is whether, in addition to being unbiased, b  is fully e cient: 2 ln ( x ; )  2 = ( P =1 2 ) I ( )= ( 2 ln ( x ; )  2 )= P =1 ( ) 2 = P =1  2 =  2 = Hence, the C-R ( ) = =   ( b  ) and thus b  is fully e cient. In terms of asymptotic properties b  is clearly consistent since   ( b  ) →∞ 0 36
The asymptotic Normality is less obvious, but since 1 I ( ) →∞ 1 the scaling sequence is { } =1 : ( b  ) v →∞ N (0 1 ) This, however, is not a satisfactory result because the variance involves the un- known . A more general result that is often preferable is to use { p I ( ) } =1 as the scaling sequence: p I ( )( b  )=(   ) v →∞ N (0 1)   = P =1 v P (  ) Example 12.20 . Consider an NI model with separable heterogeneity: v NI (  1)   ( ; )= 1 2 exp( (  ) 2 2 )   N   R   R The distribution of the sample is: ( x ; ) = Q =1 1 2 exp( (  ) 2 2 )=( 1 2 ) exp( 1 2 P =1 (  ) 2 )= =( 1 2 ) exp( 1 2 P =1 2 ) exp( P =1  2 2 ) 37
since (  ) 2 = 2 + 2 2 2    = P =1 2 = ( +1)(2 +1) 6 and thus ln ( ; x ) is: ln ( ; x )= const. + P =1  2 2 ln ( ; x )  = P =1   =0 b  = 1 P =1  ( b  )=    ( b  )= 1 2 P =1 2   ( )= 2 = 1 I ( )= ( 2 ln ( ; x )  2 )= C-R ( ) = 1 =   ( b  ) These results imply that b  is unbiased, fully e cient and consistent. In addi- tion, since 1 I ( ) →∞ 1 : ( b  ) v →∞ N (0 1) Summary of optimal properties of MLEs . The Maximum Likelihood method yields estimators which, under certain regularity conditions, are consis- tent , asymptotic Normal , unbiased and e cient . In addition they satisfy excel- lent fi nite sample properties, such as reparameterization invariance, su ciency as well as unbiasedness-full e ciency when they hold simultaneously. 38
2.6 The Maximum Likelihood method and its critics MLE's are not appropriate when: (a) the sample size is inappropriately small, (confused!). Answer : If is too small to test the model assumptions, it it too small for inference purposes! (b) the regularity conditions do not hold (irrelevant). Answer : To get general results on needs to demarcate the statistical models for which such results hold! (c) the postulated statistical model is problematic (irregular). Example 12.21 : Neyman and Scott (1948) model. The statistical GM for this N-S model takes the form : X = μ + ε   =1 2     where the underlying distribution Normal of the form: X := μ 1 2 v NI μμ μ 2 0 0 2 ¶¶   =1 2       (2.6.20) note that this model is not well-de fi ned since it has an incidental parameter 39
problem : the unknown parameters ( 1   2    ) increase with the sample size . Neyman and Scott attempted to sidestep this problem by declaring 2 the only parameter of interest and designating ( 1   2    ) as nuisance parameters , which does not deal with the problem. Let us ignore the incidental parameter problem and proceed to derive the distri- bution of the sample and the log-likelihood function: ( x ; θ ) = Q =1 2 Q =1 1 2 n 1 2 2 (  ) 2 o = Q =1 1 2  2 n 1 2 2 [( 1 ) 2 +( 2 ) 2 ] o ln ( θ ; x )= ln 2 1 2 2 P =1 [( 1 ) 2 + ( 2 ) 2 ] (2.6.21) In light of (2.6.21), the "MLEs" are then derived by solving the fi rst-order conditions: ln ( θ ; x )  = 1 2 [( 1 )+( 2 )]=0 b = 1 2 ( 1 + 2 )   =1    ln ( θ ; x )  2 = 2 + 1 2 4 P =1 [( 1 ) 2 + ( 2 ) 2 ]=0 b 2 = 1 2 P =1 [( 1 b ) 2 + ( 2 b ) 2 ]= 1 P =1 ( 1 2 ) 2 4 (2.6.22) 40
Critics of the ML method claim that ML yields inconsistent estimators since: ( b )= ,   ( b )= 1 2 2 9 →∞ 0   ( b 2 )= 1 2 2 b 2 P 1 2 2 6 = 2 This, however, is a misplaced criticism since by de fi nition 2 = (  ) 2 and thus any attempt to fi nd a consistent estimator of 2 calls for a consistent esti- mator of but b = 1 2 ( 1 + 2 ) is inconsistent . In light of that, the real question is not why the ML does not yield a consistent estimator of 2 but given that (2.6.20) is ill-speci fi ed: I Why would the ML method yield a consistent estimator of 2 ? Indeed, the fact that the ML method does not yield consistent estimators in such cases is an argument in its favor, not against it! The source of the problem is not the ML method but the statistical model in (2.6.20) which su ff ers from the incidental parameter problem. This problem can be addressed by respcifying (2.6.20) using the transformation: = 1 2 ( 1 2 ) v NIID ¡ 0   2 ¢   =1 2     (2.6.23) Now the MLE for 2 is: b 2  = 1 P =1 2 which is unbiased, fully e cient and strongly consistent: ( b 2  )= 2    ( b 2  )= 2 4 b 2   2 41
The criticism in (c) relates to ill-speci fi ed models with su ff ering from the inciden- tal parameter or contrived constraints that give rise to unnatural reparameteri- zations are imposed on the parameters at the outset; see Spanos (2010b; 2011a; 2012b; 2013a-d). 3 The Least-Squares method 3.1 The mathematical principle of least-squares The principle of least-squares was originally proposed as a mathematical approx- imation procedure by Legendre in 1805. Approximating of an unknown function ( ) : R R : = ( ) (   ) ( R × R ) by selecting an approximating function, say linear: ( )= 0 + 1  (   ) ( R × R ) and fi tting ( ) using data z 0 := { (   )   =1 2    } This curve- fi tting problem involves the approximation error: = ( ) ( ) giving rise to the problem of how to use data z 0 to get the best approximation by fi tting: = 0 + 1 +   =1 2       (3.1.24) The earliest attempt to address this problem was made by Boscovitch in 1757 by 42
proposing (Hald, 1998, 2007) the criterion: min 0  1 P =1 | | subject to P =1 =0 (3.1.25) using a purely geometric argument about its merits. In 1789 Laplace proposed an analytic solution to the minimization problem in (3.1.25) that was rather laborious to implement. In 1805 Legendre o ff ered a less laborious solution to the approximation problem by replacing P =1 | | with P =1 2 giving rise to the much easier minimization of the sum of squares (least-squares) of the errors: min 0  1 P =1 2 In the case of (3.1.24), the principle of least squares amounts to minimizing: ( 0   1 )= P =1 ( 0 1 ) 2 (3.1.26) The fi rst order conditions for a minimum, called the normal equations , are: (i)   0 =( 2) P =1 ( 0 1 )=0 (ii)   1 =( 2) P =1 ( 0 1 ) =0 Solving these two equations for ( 0   1 ) yields the Least-Square estimates: 43
b 0 = b 1 ¯  b 1 = P =1 ( )( ¯ ) P =1 ( ¯ ) 2 (3.1.27) Example 12.22 . The fi tted line b = b 0 + b 1 through a scatter-plot of data ( =200 ) in fi gure 12.1 is: b =1 105 + 809 (3.1.28) In addition to (3.1.28), one could construct goodness-of- fi t measures: 2 = 1 2 P =1 b 2 = 224   2 =1 £ P =1 b 2 P =1 ( ) 2 ¤ = 77 8 (3.1.29) As it stands, however, (3.1.28)-(3.1.29) provides no basis for inductive inference. 5 4 3 2 1 0 -1 -2 5 4 3 2 1 0 x y Fig. 12.1: Least-squares line fi tting 44
The above mathematical approximation perspective to curve- fi tting does not have any probabilistic premises stating the conditions under which the statistics ¡ b 0 b 1   2   2 ¢ are inferentially meaningful and reliable, as opposed to mathe- matically meaningful. 3.2 Least squares as a statistical method It is interesting to note that Legendre's initial justi fi cation for the least-squares method was that for the simplest case where ( )=  (   ) ( R × R ) : = +   =1 2       (3.2.30) minimizing the sum of squares P =1 2 yields: ( )= P =1 ( ) 2   =( 2) P =1 ( ) =0 giving rise to the arithmetic mean : b = 1 P =1 At that time, the arithmetic mean was considered to be the gold standard for summarizing the information contained in the data points 1   2    unaware that this presumes that ( 1    ) are IID. The fi rst probabilistic framing for least-squares was given by Gauss (1809). 45
He introduced the Normal distribution by arguing that for a sequence of independent random variables 1   2    whose density function ( ) satisfy certain regularity conditions, if ¯ is the most probable combination for all values of 1   2    and each 1 , then ( ) is Normal; see Heyde and Seneta (1977), p. 63. This provided the missing probabilistic premises , and Gauss (1821) went on to prove an important result. Gauss-Markov theorem. Gauss supplemented the statistical GM (3.2.30) with the probabilistic assumptions: (i) ( )=0 (ii) ( 2 )= 2 0 (iii) ( )=0   6 =    =1 2       and proved that under assumptions (i)-(iii) the least-squares estimator b  = 1 P =1 is Best (smallest variance) within the class of Linear and Unbiased Estimators (BLUE). Proof. Any linear estimator of will be of the form e ( w )= P =1 where w :=( 1   2    ) denote constant weights. For e ( w ) to be unbiased it must be the case that P =1 =1 since ( e ( w ))= P =1 ( )=  This implies that the problem of minimizing   ( e ( w ))= 2 P =1 2 can be transformed into a 46
Largange multiplier problem: min w L ( w )= ¡ P =1 2 ¢ 2 ( P =1 1) whose fi rst order conditions for a minimum yield: L ( w ) =2 2 =0 ( = ) L ( w ) = 2 ( P =1 1) =0 ) X =1 =1 = 1   =1 2       This proves that b = 1 P =1 is BLUE of  The Gauss-Markov theorem is of very limited value in 'learning from data' be- cause a BLUE estimator provides a very poor basis for inference, since the sam- pling distributions of b  and b 2  = 1 1 P =1 ( b  ) 2 are unknown: b  v ? 1 (  2 ) b 2  v ? 2 ( ¡ 1 ¢ 2 4 2 2 2( 4 2 2 2 ) 2 + 4 3 2 2 3 )  ( b  b 2 )= ¡ 1 ¢ 3 In addition, the class of LU estimators is unnecessarily narrow and excludes from consideration non-linear functions of Y . 47
4 Moment Matching (MM) principle The moment matching principle was the result of a fundamental confusion between distribution and sample moments ; Fisher (1922) (p. 311). Table 12.17: Parameters vs. estimators vs. estimates Terms Probability Sample ( X ) Data ( x 0 ) Mean: 0 1 = Z R  ( )  b 0 1 ( X )= 1 P =1 :=  b 0 1 ( x 0 )= 1 P =1 Variance: 2 = Z R ( 0 1 ) 2 ( )  b 2 ( X )= 1 P =1 ( ) 2 b 2 ( x 0 )= 1 P =1 ( ) 2 Raw moments: 0 = Z R ( )  0 ( X )= 1 P =1 0 ( x 0 )= 1 P =1   =1 2   Central moments: = Z R ( 0 1 ) ( )  b ( X )= 1 P =1 ( ) b ( x 0 )= 1 P =1 ( )   =2  48
Moment Matching (MM) principle : construct estimators by equating dis- tribution moments with sample moments in two steps: Step 1 : Relate the unknown parameter to the moments of the distribution in terms of which the probability model is speci fi ed, say, = ( 0 1   0 2 )   =1 2 Step 2 : Substitute the sample moments in the place of the distribution moments b 0 1 = 1 P =1 b 0 2 = 1 P =1 2 to construct the moment estimators of ( 1   2 ) via: b 1 = 1 ( b 0 1 b 0 2 ) b 2 = 2 ( b 0 1 b 0 2 ) Example 12.23 . For the simple Bernoulli model (table 12.4): ( )=  and thus the MM Principle suggests that a natural estimator for is: b = 1 P =1 Example 12.24 . For the simple Normal model (table 12.6), the unknown parameters θ :=(   2 ) are related to the mean and variance of  respectively: ( )=    ( )= 2 49
The MM principle proposes the obvious estimators of these parameters, i.e. b = 1 P =1 ˆ 2 = 1 P  =1 ( b ) 2 Example 12.25 . For the Normal Linear Regression model (table 7.7), θ :=( 0   1   2 ) are related to the moments of the bivariate distribution (   ; ϕ ) via: 0 = ( ( ) 1 ( )) R   1 = ³  (  )   ( ) ´ R   2 =   ( ) [  (  )] 2   ( ) R + By substituting the corresponding sample moments, in place of the distribution moments, we get the following MM principle estimators: b 0 = ˆ 1 ¯  b 1 = 1 P =1 ( )( ¯ ) 1 P =1 ( ¯ ) 2 b 2  = 1 P =1 ( ) 2 ( 1 P =1 ( )( ¯ ) ) 2 1 P =1 ( ¯ ) 2 Recall that in example 12.3 the fi rst two moments of the sample mean b = 1 P =1 as an estimator of ( )= in the case where v UIID ( 5 , + 5) are: b = 1 P =1 : ( b )=    ( b )= 1 12 50
but the same moments for b  ( X )= [ ] + [1] 2 are: b  ( X )= [ ] + [1] 2 : ( b  ( X ))=    ( b  ( X ))= 1 2( +1)(