Fewer than 10 events per variable (created 2009-02-18)

This page is moving to a new website.

I am in the process of advising on the design of a study using logistic regression. There are five confounding variables and a treatment variable. If I apply the rule that you need 10 events per variable (EPV), then I need 60 events. I expect that the probability of observing an event is 40%. This means that I'll need data on 60 / 0.4 = 150 patients. I can only collect data on 90 patients, and that sample size gives me more than adequate power. Since my power will be fine, can I ignore the rule of thumb about 10 EPV?

One quick caveat. If you have 5 potential confounders and 3 of them turn out to be non-significant, you have not improved your EPV ratio. The very process of screening or evaluating a covariate causes the problem, whether or not it survives the final cut using stepwise approaches or anything similar. If you look at a variable at any time during the data analysis phase (excluding trivial descriptive statistics), then it counts towards your EPV ratio. The only two ways to improve your EPV ratio is to

  1. eliminate some confounders during the planning phase (before any data is collected).
  2. increase your sample size and thus your number of events.

There are some people who screen 5 potential confounders and only include 2. If you pretend that you never looked at all 5 confounders, that is a form of fraud because you are withholding information that people need in order to critically evaluate your work.

I always tell people that no one gets thrown in jail for violating a rule of thumb. If you get data that has fewer than 10 EPV, it is probably still worth publishing, because failure to publish is a greater sin. Just be sure to mention the problems with overfitting in your discussion.

If you are in the planning stages, ask yourself whether collecting this data, with all its limitations, is worth the trouble. It's probably an oversimplification, but I tell people that if you have too little data for the complexity of the statistical model being considered, your results are likely to have problems during replication. Do you really want to collect data and make a decision when you know in your heart that the results are likely to be different when replicated? It seems rather anti-scientific, doesn't it?

There's a lot of pressure to do research, but I would argue that we need to do less research. We have too many researchers chasing too few patients. We should be more aggressive at investigating multi-center trials. A few research studies done well are a whole lot better than a lot of research studies done poorly.

There are some newer statistical models that overcome some of the problems with having too many confounding variables. In particular, propensity score models are definitely worth exploring. The propensity score creates a composite variable which emphasizes those covariates which show the greatest degree of imbalance and which, therefore, are the most troublesome. It can get you out of some tight spots. You have to be careful not to take the propensity score and toss it in as a covariate, but rather to use it to stratify or match observations.

The leads to my final point. If these confounders are causing you problems, is there any way you can use matching in your design? Matching would avoid all of the problems of the EPV ratio, though it is tricky to implement in many situations.