Stats: The S+ CorrelatedData Library (May 19, 2005)

The S+ CorrelatedData Library (May 19, 2005)

This page is moving to a new website.

Jose Pinheiro gave a web seminar on the S+ CorrelatedData Library. An archive of this presentation is at

www.insightful.com/news_events/webcasts/2005/05novartis/default.asp

I have reported on other S+ web seminars in the past.

S-plus version 7 (April 19, 2005)
Seminar notes, S-PLUS Clinical Safety Miner (March 10, 2005)
Surromed and the DecisionSite S-Plus Server Solution (December 8, 2004)

The CorrelatedData library extends the Generalized Linear Model (GLM) to single level and multi-level group problems.

The classic linear regression model assume that the outcome variable is a linear function of the predictor variables plus and error term with constant variance. The GLM extends the linear model in two different ways. First, it allows for a link function so that a function can be linear on a different scale (such as a log scale). It also allows for a variance function so that the error term has a non-constant variance that changes as the mean changes. This allows, for example, for better modeling of count data because groups with higher average counts also tend to have greater amounts of variability.

In S-plus, you use the glm() function to fit a GLM model. There are three important arguments in glm()

formula is a linear formula relating the predictor variables to the outcome,
family specifies a combination of link function and variance function that works well for a particular distribution. For example, the Poisson family uses a log link and a variance function that is proportional to the mean squared.
data specifies the data frame that includes the predictor variables and the outcome.

Dr. Pinheiro showed two examples of GLM model where the data did not seem to fit very well. In both models, the dispersion parameter could be estimated as the residual deviance divided by the degrees of freedom was much larger than 1. This implies that there was substantial center to center variation in the first study (a multi-center trial) and substantial patient to patient to patient variation in the second study (a longitudinal trial). The residual plots were also troublesome because some centers/patients had entirely negative residuals and others had entirely positive residuals.

The Generalized Linear Mixed Model (GLMM) is an extension of GLM that allows for better modeling of center to center variation or patient to patient variation.

GLMM can be thought of as a compromise between

the GLM model across all patients or centers that ignores the center or patient effects and
a separate GLM model for each patient or center that fits each center or patient exactly.

The latter approach can be very inefficient, since you lose a degree of freedom for each patient or center in your analysis. The GLMM model can thought of having each individual or center borrowing strength from the other patients or centers in the model.

GLMM is also an extension of the linear mixed effects (LME) model. It allows for mixed effects like LME, but provides the flexibility of a link function and variance function.

The algorithms for the GLMM model are complex because a maximum likelihood approach would take too much time, even with today's superfast computers. Alternative approaches include PQL (Penalized Quasi Likelihood) and MQL (Marginal Quasi Likelihood). There are restricted versions of these algorithms, REPQL and PREMQL. For specific families (the binomial and Poisson families), there are additional fitting algorithms, the Laplacian algorithm and adaptive Gaussian Quadarature. These approaches may avoid some of the problems of PQL and MQL which have been shown in certain circumstances to produce biased estimates.

To fit a GLMM in S-plus, you use the glme() function. This function has the arguments for fixed, data, and family that match the glm() function, and a new argument, random. Random specifies the random effects. Typically this would incorporate variation due to centers in a multi-center trial or variation due to patients in a longitudinal study.

The glme() function produces a glme() object with information that you can print or plot. You can, for example, easily produce residual plots from the glme() object or test for the normality of the random effects.

You can generalize a GLMM to allow for multiple random effects, such as effects due to differing countries, different centers with each country, and different patients within each center. This is an example of nested random effects. The PQL and MQL algorithms work very efficiently with nested random effects and take advantage of the special structure that the nesting produces. You can also look at random effects that are crossed with each other rather than nested, but the algorithms here are not as efficient.

I aksed if Dr. Pinheiro could explain the difference between the Generalized Estimating Equations, which uses a marginal effects approach and GLMM which uses a random effects approach. He said that the GEE approach goes straight to the correlation structure of the data, while GLMM produces a correlation structure implicitly.