Oberski2014

.pdf
JSS Journal of Statistical Software March 2014, Volume 57, Issue 1. http://www.jstatsoft.org/ lavaan.survey : An R Package for Complex Survey Analysis of Structural Equation Models Daniel Oberski Tilburg University Abstract This paper introduces the R package lavaan.survey , a user-friendly interface to design- based complex survey analysis of structural equation models (SEMs). By leveraging ex- isting code in the lavaan and survey packages, the lavaan.survey package allows for SEM analyses of stratified, clustered, and weighted data, as well as multiply imputed complex survey data. lavaan.survey provides several features such as SEMs with replicate weights, a variety of resampling techniques for complex samples, and finite population corrections, features that should prove useful for SEM practitioners faced with the common situation of a sample that is not iid . Keywords : complex survey analysis, structural equation modeling, clustering, stratification, sampling weights, multiple imputation, resampling, jackknife, bootstrap, replicate weights, R . 1. Introduction Structural equation models (SEMs) constitute a popular framework for formulating, fitting, and testing an abundant variety of models for continuous interval-level data in a wide range of fields. Special cases of structural equation modeling include factor analysis, (multivariate) linear regression, path analysis, random growth curve and other longitudinal models, errors-in- variables models, and mediation analysis ( Bollen 1989 ; Kline 2011 ). The main development of structural equation modeling has been in social science fields such as psychology ( Ullman and Bentler 2003 ), education ( Kaplan 2008 ), and sociology ( Duncan 1975 ; Saris and Stronkhorst 1984 ), while more recently structural equation modeling is finding applications in other fields such as ecology and biology ( Grace 2006 ) and neuroscience ( Mclntosh and Gonzalez-Lima 1994 ; Roelstraete and Rosseel 2011 ). While classical SEM theory assumes independently and identically distributed ( iid ) obser- vations ( Bollen 1989 ), applications often require the analysis of data from complex surveys
2 lavaan.survey : Complex Survey Analysis of Structural Equation Models that may involve stratification, clustering, and unequal selection probabilities, violating this assumption ( Skinner, Holt, and Smith 1989 ; Muth´ en and Satorra 1995 , p. 281). For example, Marsh and Hau ( 2004 ) explained the relations between academic self-concepts and achieve- ments in a 26-country complex multistage survey. Outside of the realm of complex surveys clustering may also occur, for instance in Byrnes et al. ( 2011 )'s analysis of the effect of storms on kelp forest food webs, where variables such as kelp density and species richness are likely correlated across sites that are geographically close to each other. It is well-known that under complex sampling, both point and variance estimators derived under iid assumptions may produce biased and inconsistent estimates ( Cochran 1977 ; Skinner et al. 1989 ). This finding was reproduced for SEM parameter estimates by Kaplan and Ferguson ( 1999 ) and Asparouhov and Muth´ en ( 2005 ). Hahs-Vaughn and Lomax ( 2006 ) analyzed student data from the Begin- ning Postsecondary Students Longitudinal study to explain college experiences and learning outcomes with pre-college traits, showing that SEM parameter estimates, standard errors, and fit measures can change dramatically when complex sampling is taken into account. Adjustments to point and variance estimators for SEMs under complex sampling were dis- cussed by Muth´ en and Satorra ( 1995 ) and Stapleton ( 2006 ), and estimation using pseudo- maximum likelihood procedures by Asparouhov ( 2005 , 2006 ) and Asparouhov and Muth´ en ( 2005 ). For an overview of literature related to complex sampling in structural equation modeling, see Bollen, Tueller, and Oberski ( 2013 ). These procedures have since been im- plemented in standard closed-source commercial software for SEMs: LISREL ( oreskog and orbom 2006 ), Mplus ( Muth´ en and Muth´ en 2012 ), EQS ( Bentler 2008 ), and Stata ( Stata- Corp. 2011a , b ). Another popular commercial program, AMOS ( Arbuckle 2011 ), does not implement complex sampling estimation at the date of writing. None of the open-source SEM packages, sem ( Fox 2006 ; Fox, Nie, and Byrnes 2012 ), OpenMx ( Boker et al. 2011 ), and lavaan ( Rosseel 2012 ), directly implement complex survey adjust- ments. These packages do provide enough flexibility to allow for such adjustments through resampling methods if the user is willing to program these (the sem manual provides some guidance to this effect). More user-friendly interfaces are currently not available. Further- more, with the exception of Stata and Mplus , the commercial packages that do implement estimation procedures for complex sampling still omit features dealing with several complica- tions that may arise in the analysis of complex surveys: Some secondary data sources such as the OECD's Programme for International Student Assessment (PISA) do not provide the sampling design variables directly, but instead provide a set of so-called "replicate weights" ( OECD 2009 ). In principle this represents a considerable simplification of highly complex survey analysis ( Brick, Morganstein, and Valliant 2000 ). Currently, however, not all SEM software allows for adjustments of SEM estimators using replicate weights; More generally, variance estimation of SEM parameters with complex sampling using resampling methods such as the jackknife and bootstrap are not implemented directly but require additional programming on the part of the user (see Stapleton 2008 , for a discussion of these methods in the context of SEMs); Structural equation modeling is primarily an analytic method, so that finite population corrections may not usually be relevant (e.g., Fuller 2009 , p. 342). However, structural equation modeling is also a flexible method of reformulating several descriptive methods
Journal of Statistical Software 3 for which the finite population may be of interest, such as domain mean and model- based small area estimation. Currently finite population corrections, which may be relevant for these purposes, are not available in all SEM programs. The purpose of this article is to introduce the lavaan.survey package ( Oberski 2013a ) for the R environment ( R Core Team 2013 ), which serves to bring user-friendly complex survey SEM analysis to the open source SEM implementation lavaan . In addition, by leveraging the many features of the survey package ( Lumley 2004 , 2010 , 2012b ) it provides users with the above features currently omitted from some commercially available SEM software packages. Thanks to code reuse and the flexibility of the survey and lavaan packages, the lavaan.survey package is able to provide an extremely flexible, user-friendly, and open source framework for design-based analysis of complex survey data using SEM. It also allows for the analysis of multiply imputed complex survey data ( Little and Rubin 1987 ; Graham and Hofer 2000 ). At the time of writing, a limitation of the package is that it deals with the continuous case only. The package is available from the Comprehensive R Archive Network at http://CRAN. R-project.org/package=lavaan.survey . Section 2 discusses the theory of structural equation modeling in general and SEM under complex sampling in particular. After a brief overview of the package in Section 3 , Sec- tions 4.1 , 4.2 , 4.3 , and 4.4 demonstrate the usage of the package by applying it to SEM analyses arising from the literature. 2. Technical explanation Different methods have been suggested to deal with complex sampling in SEMs. In this article we will only deal with "aggregate" design-based methods (see Skinner et al. 1989 , p. 8; Muth´ en and Satorra 1995 ). "Design-based" refers to the fact that inferences are based on the theoretical distribution of all possible samples under a particular survey design. Such a basis for inference stands in contrast to the "model-based" approach, which derives point and variance estimators from the assumed model. In practice, the two may sometimes coincide (see Sterba 2009 , for an overview). Three aggregate design-based point estimators have been suggested in the literature: adjustment of the weights or sample size to an effective sample size ( Stapleton 2002 ), pseudo-maximum likelihood ( Muth´ en and Satorra 1995 ; Asparouhov 2005 , 2006 ), and weighted least squares estimation ( Skinner et al. 1989 , p. 86; Vieira and Skinner 2008 ); see Stapleton ( 2006 ) for an overview of these approaches. For these point estimators, different variance estimation methods are possible, including linearization ( Skinner et al. 1989 , p. 83; Muth´ en and Satorra 1995 , p. 279) and a range of resampling methods ( Stapleton 2008 ). This article and the lavaan.survey package adopt a framework due to Muth´ en and Satorra ( 1995 ) that encompasses pseudo-maximum likelihood (PML) or weighted ("generalized") least squares (WLS) point estimation, and variance estimation by linearization or resampling. The option of which combination of methods to employ is left to the user, the default being PML, the de facto standard for SEMs at the time of writing ( Asparouhov 2005 ). The framework adopted here starts from the observation ( Skinner et al. 1989 , p. 78) that the problem of the estimation of SEM parameters under complex sampling can be simplified to the usual problem of estimation of means under complex sampling through a classical three-step device (e.g., Fuller 1987 , Appendix 4.B). The current discussion of this remarkable observation is necessarily more condensed than that found in the comprehensive discussion by
Page1of 27
Uploaded by ConstableGrasshopper3776 on coursehero.com