How to Handle Correlated Data

In simplest term, linear model can describe a continuous response variable as a function of predictor variables. For example, a student’s exam result can be affected by couple of different factors (known as fixed effects β). The fixed effects could be the amount of study time, how frequent the student goes to lecture / tutorial, the amount spent for consultation with lecturer, how many sleep hours or other factors that could affect performance in the exam day.

However, the linear model would have been more accurate if we also consider the inherent ability of each student (since each person is born with different capability) known as the random effect. This inherent ability can be incorporated to the model given by considering it comes from the normal distribution with mean 0 and some variance. It is called linear mixed model (LMM) since it includes both fixed and random effects.

Also, one might wonder what if someone wanted to analyse different model that the response variable is binary and count? For example, yes/no response variable, or the number of Salamanders in different sites. From here, rather than modelling direct mean for the continuous response variable, the commonly used tool to transform that into binary and count response would be to use link function. This is where it introduced the idea of generalized LMM. Now, suppose we continue the Salamanders example above which would have marginal model of clustered data (cluster refers to site). We look at the standard error of the marginal models, there are three things need to be correctly assumed:

Condition of mean mode – given the factors affecting results, it would generate a mean-model based on link function (which in this case would be log-link)
Conditional Variance – given the factors affecting results, it is given by some variance function multiplied by a constant.
Association – relation between two observations within the same cluster (site)

Thus, after setting up 3 things above, we introduce the idea of Liang & Zeger (1986), to use Generalize Estimating Equations (GEE) to find the variance or standard error of β. In general, they introduce the idea of having sandwich estimator to give robust standard error regardless model is correctly specified. (It is being named sandwich because it has 3 components, first and third part of the formula would be deemed as the “bread”, and the middle part would be the “meat”). However, it will only work for large sample size, so it requires specifying good starting model if using small sample size to get accurate variance / standard error. Note that the GEE would only consider model with only fixed effects. Model with both fixed and random effect, then it would use the same idea for the sandwich estimator but with a little different calculation or matrix being used in each application to the model.

Reference:

Liang K-Y; Zeger ZL. Biometrika, Vol. 73, No. 1. (Apr., 1986), pp. 13-22.

Wilson Lorensyah
The University of Queensland