School

University of Michigan **We aren't endorsed by this school

Course

POLSCI 687

Subject

Statistics

Date

Sep 30, 2023

Pages

27

Uploaded by ConstableGrasshopper3776 on coursehero.com

JSS
Journal of Statistical Software
March 2014, Volume 57, Issue 1.
http://www.jstatsoft.org/
lavaan.survey
: An
R
Package for Complex Survey
Analysis of Structural Equation Models
Daniel Oberski
Tilburg University
Abstract
This paper introduces the
R
package
lavaan.survey
, a user-friendly interface to design-
based complex survey analysis of structural equation models (SEMs). By leveraging ex-
isting code in the
lavaan
and
survey
packages, the
lavaan.survey
package allows for SEM
analyses of stratified, clustered, and weighted data, as well as multiply imputed complex
survey data.
lavaan.survey
provides several features such as SEMs with replicate weights,
a variety of resampling techniques for complex samples, and finite population corrections,
features that should prove useful for SEM practitioners faced with the common situation
of a sample that is not
iid
.
Keywords
: complex survey analysis, structural equation modeling, clustering, stratification,
sampling weights, multiple imputation, resampling, jackknife, bootstrap, replicate weights,
R
.
1. Introduction
Structural equation models (SEMs) constitute a popular framework for formulating, fitting,
and testing an abundant variety of models for continuous interval-level data in a wide range
of fields. Special cases of structural equation modeling include factor analysis, (multivariate)
linear regression, path analysis, random growth curve and other longitudinal models, errors-in-
variables models, and mediation analysis (
Bollen 1989
;
Kline 2011
). The main development of
structural equation modeling has been in social science fields such as psychology (
Ullman and
Bentler 2003
), education (
Kaplan 2008
), and sociology (
Duncan 1975
;
Saris and Stronkhorst
1984
), while more recently structural equation modeling is finding applications in other fields
such as ecology and biology (
Grace 2006
) and neuroscience (
Mclntosh and Gonzalez-Lima
1994
;
Roelstraete and Rosseel 2011
).
While classical SEM theory assumes independently and identically distributed (
iid
) obser-
vations (
Bollen 1989
), applications often require the analysis of data from complex surveys

2
lavaan.survey
: Complex Survey Analysis of Structural Equation Models
that may involve stratification, clustering, and unequal selection probabilities, violating this
assumption (
Skinner, Holt, and Smith 1989
;
Muth´
en and Satorra 1995
, p. 281). For example,
Marsh and Hau
(
2004
) explained the relations between academic self-concepts and achieve-
ments in a 26-country complex multistage survey. Outside of the realm of complex surveys
clustering may also occur, for instance in
Byrnes
et al.
(
2011
)'s analysis of the effect of storms
on kelp forest food webs, where variables such as kelp density and species richness are likely
correlated across sites that are geographically close to each other. It is well-known that under
complex sampling, both point and variance estimators derived under
iid
assumptions may
produce biased and inconsistent estimates (
Cochran 1977
;
Skinner
et al.
1989
). This finding
was reproduced for SEM parameter estimates by
Kaplan and Ferguson
(
1999
) and
Asparouhov
and Muth´
en
(
2005
).
Hahs-Vaughn and Lomax
(
2006
) analyzed student data from the Begin-
ning Postsecondary Students Longitudinal study to explain college experiences and learning
outcomes with pre-college traits, showing that SEM parameter estimates, standard errors,
and fit measures can change dramatically when complex sampling is taken into account.
Adjustments to point and variance estimators for SEMs under complex sampling were dis-
cussed by
Muth´
en and Satorra
(
1995
) and
Stapleton
(
2006
), and estimation using pseudo-
maximum likelihood procedures by
Asparouhov
(
2005
,
2006
) and
Asparouhov and Muth´
en
(
2005
).
For an overview of literature related to complex sampling in structural equation
modeling, see
Bollen, Tueller, and Oberski
(
2013
).
These procedures have since been im-
plemented in standard closed-source commercial software for SEMs:
LISREL
(
J¨
oreskog and
S¨
orbom 2006
),
Mplus
(
Muth´
en and Muth´
en 2012
),
EQS
(
Bentler 2008
), and
Stata
(
Stata-
Corp. 2011a
,
b
).
Another popular commercial program,
AMOS
(
Arbuckle 2011
), does not
implement complex sampling estimation at the date of writing.
None of the open-source SEM packages,
sem
(
Fox 2006
;
Fox, Nie, and Byrnes 2012
),
OpenMx
(
Boker
et al.
2011
), and
lavaan
(
Rosseel 2012
), directly implement complex survey adjust-
ments. These packages do provide enough flexibility to allow for such adjustments through
resampling methods if the user is willing to program these (the
sem
manual provides some
guidance to this effect). More user-friendly interfaces are currently not available.
Further-
more, with the exception of
Stata
and
Mplus
, the commercial packages that do implement
estimation procedures for complex sampling still omit features dealing with several complica-
tions that may arise in the analysis of complex surveys:
Some secondary data sources such as the OECD's Programme for International Student
Assessment (PISA) do not provide the sampling design variables directly, but instead
provide a set of so-called "replicate weights" (
OECD 2009
). In principle this represents
a considerable simplification of highly complex survey analysis (
Brick, Morganstein, and
Valliant 2000
). Currently, however, not all SEM software allows for adjustments of SEM
estimators using replicate weights;
More generally, variance estimation of SEM parameters with complex sampling using
resampling methods such as the jackknife and bootstrap are not implemented directly
but require additional programming on the part of the user (see
Stapleton 2008
, for a
discussion of these methods in the context of SEMs);
Structural equation modeling is primarily an analytic method, so that finite population
corrections may not usually be relevant (e.g.,
Fuller 2009
, p. 342). However, structural
equation modeling is also a flexible method of reformulating several descriptive methods

Journal of Statistical Software
3
for which the finite population may be of interest, such as domain mean and model-
based small area estimation.
Currently finite population corrections, which may be
relevant for these purposes, are not available in all SEM programs.
The purpose of this article is to introduce the
lavaan.survey
package (
Oberski 2013a
) for
the
R
environment (
R
Core Team 2013
), which serves to bring user-friendly complex survey
SEM analysis to the open source SEM implementation
lavaan
. In addition, by leveraging the
many features of the
survey
package (
Lumley 2004
,
2010
,
2012b
) it provides users with the
above features currently omitted from some commercially available SEM software packages.
Thanks to code reuse and the flexibility of the
survey
and
lavaan
packages, the
lavaan.survey
package is able to provide an extremely flexible, user-friendly, and open source framework
for design-based analysis of complex survey data using SEM. It also allows for the analysis
of multiply imputed complex survey data (
Little and Rubin 1987
;
Graham and Hofer 2000
).
At the time of writing, a limitation of the package is that it deals with the continuous case
only. The package is available from the Comprehensive
R
Archive Network at
http://CRAN.
R-project.org/package=lavaan.survey
.
Section
2
discusses the theory of structural equation modeling in general and SEM under
complex sampling in particular.
After a brief overview of the package in Section
3
, Sec-
tions
4.1
,
4.2
,
4.3
, and
4.4
demonstrate the usage of the package by applying it to SEM
analyses arising from the literature.
2. Technical explanation
Different methods have been suggested to deal with complex sampling in SEMs.
In this
article we will only deal with "aggregate" design-based methods (see
Skinner
et al.
1989
, p. 8;
Muth´
en and Satorra 1995
). "Design-based" refers to the fact that inferences are based on
the theoretical distribution of all possible samples under a particular survey design. Such a
basis for inference stands in contrast to the "model-based" approach, which derives point and
variance estimators from the assumed model. In practice, the two may sometimes coincide
(see
Sterba 2009
, for an overview). Three aggregate design-based point estimators have been
suggested in the literature: adjustment of the weights or sample size to an effective sample size
(
Stapleton 2002
), pseudo-maximum likelihood (
Muth´
en and Satorra 1995
;
Asparouhov 2005
,
2006
), and weighted least squares estimation (
Skinner
et al.
1989
, p. 86;
Vieira and Skinner
2008
); see
Stapleton
(
2006
) for an overview of these approaches. For these point estimators,
different variance estimation methods are possible, including linearization (
Skinner
et al.
1989
,
p. 83;
Muth´
en and Satorra 1995
, p. 279) and a range of resampling methods (
Stapleton 2008
).
This article and the
lavaan.survey
package adopt a framework due to
Muth´
en and Satorra
(
1995
) that encompasses pseudo-maximum likelihood (PML) or weighted ("generalized") least
squares (WLS) point estimation, and variance estimation by linearization or resampling. The
option of which combination of methods to employ is left to the user, the default being PML,
the
de facto
standard for SEMs at the time of writing (
Asparouhov 2005
).
The framework adopted here starts from the observation (
Skinner
et al.
1989
, p. 78) that
the problem of the estimation of SEM parameters under complex sampling can be simplified
to the usual problem of estimation of means under complex sampling through a classical
three-step device (e.g.,
Fuller 1987
, Appendix 4.B). The current discussion of this remarkable
observation is necessarily more condensed than that found in the comprehensive discussion by

Page1of 27