Ratio of observations to independent variables (2004-11-17)

This page has moved to my new website.

A widely quoted rule is that you need 10 or 15 observations per independent variable in a regression model. The original source of this rule of thumb is difficult to find. I briefly commented on this in an earlier weblog entry, but here is a more complete elaboration.

When you are trying to build a regression model using a stepwise variable selection process (or something similar to stepwise selection), there is substantial reason for caution. Stepwise selection tends to lead to poor choices for the regression model that do not replicate well. I abstracted some arguments against stepwise variable selection as part of the STAT-L FAQ.

Frank Harrell did some empirical investigation of stepwise variable selection in the logistic regression model and the Cox Proportional Hazards regression model. For these models, it is not the number of observations you have, but the number of events that is important. Suppose you study thousands of patients and find that in the control group four die, but only two die in the treatment group. That represents a halving of the mortality rate, yet no one would trust those results. Your sample size is effectively those six deaths rather than the thousands of patients being studied.