P.Mean: Minimum sample size needed for a time series prediction (created 2010-06-08)

Minimum sample size needed for a time series prediction (created 2010-06-08)

This page is moving to a new website.

Someone asked what the minimum sample size that was needed in a time series analysis model to forecast future observations. Strictly speaking, you can forecast with two observations. Draw a straight line connecting the two points and then extend that line as far as you want in the future. But you wouldn't want to do that. So a better question might be what is the minimum number of data points that you would need in order to provide a good forecast of the future.

Here you need to define "good". A simple and pragmatic definition is "sufficiently rigorous to withstand critical scrutiny, such as peer review." These are typically based on ad hoc and informal rules of thumb. Some of these have been cited already. Box and Jenkins, for example, recommended a minimum of 50 observations for an ARIMA model and it is commonly expected that a model with seasonal effects would have to have several "seasons" worth of data. These are necessary limits but they may not be sufficient.

A rigorous answer would say that a "good" prediction is one where the appropriate confidence limits are sufficiently narrow as to produce useful results. There's a story I tell about a researcher who gets a ten million dollar research grant. At the end of the grant, he writes up a report that says, "This is a new and innovative surgical procedure and we are 95% confident that the cure rate is somewhere between 3% and 98%.

Now, how narrow is "sufficiently narrow"? Well that is in the eyes of the beholder. Some intervals, such as the one cited above are so wide as to be obviously inadequate. But other cases are harder to call.

If you wanted to be obsessive about this, you would have to think about the economic value of your predictions, the economic costs of collecting the data and balance the two. It's bad to invest in an electron microscope to see objects that would be readily viewable in a less expensive optical microscope. It's equally bad to rely on an optical microscope when you need to see things much much smaller than the maximum resolution. So sometimes a sample size is gross overkill and other times it is grossly inadequate. The economics of your particular situation will dictate which might be the case.

What about the case where the data are free--they've already been collected and they are sitting in front of you begging to be analyzed. There is still an economic cost, if only the cost of your labor to do all the data modeling. Is it worth five hours of your life to produce a confidence intervals that goes from 3% to 98%? What would you be doing if you weren't analyzing this data set?

So an economic balancing act is always possible. So first make sure you meet some of the minimum sample sizes like the ones proposed by Box and Jenkins. The make sure that the economic value of a narrow confidence interval is justified based on the cost of collecting and analyzing the data. Balancing the cost of sampling against the benefits of the prediction, however, is a task that is almost never done. It is much simpler to rely on simplistic rules of thumb. But even if you don't want to do the economic calculations yourself, present your predictions with appropriate intervals and let the readers make their own assessment.