P.Mean: A simple example of overfitting (created 2008-10-08).

A couple of the Internet discussion groups that I participate in have been discussing the concept of overfitting. Overfitting occurs when a model is too complex for a given sample size. I want to show a simple example of the negative consequences of overfitting.

In a previous page discussing segmented regression models, I used a data set showing firearm deaths in Australia per 100,000 people over a fifteen year span (1983 to 1997). The following graph shows a linear regression fit (order 1 polynomial) to the data.

With only 15 data points, you should probably not fit any model more complex than this. But a quadratic fit (order 2 polynomial) to the data is not terrible (see below).

In practice, there would be plenty of evidence that stopping at a quadratic model is a good idea, but let's pretend that we ignored those indicators and continued to add more terms to the polynormial. The cubic and quartic (order 3 and 4) polynomials seem scarecely different from the quadratic polynomial.

A fifth and sixth order polynomial is definitely uncalled for, but visually, they look almost identical to the previous three polynomials.

Starting with a seventh order polynomial things start to look bad.

The fitted line shows strange behavior near the first three or four data points, taking a seemingly needless dip between 1983 and 1984 and inserting a needless peak between 1985 and 1986. The order 8 polynomial fixes this anomaly but replaces it with a needless dip and peak at 1995 and 1996, respectively.

The ninth order polynomial has several peaks and dips nearing the beginning and the end.

Compared to the previous three polynomials, the tenth order polynomial looks rather tame, but it seems to be trying to hard to get close to all of the data points.

The eleventh order polynomial is terrible, with a huge dip between 1983 and 1984, and huge peaks between 1984 and 1985 and between 1996 and 1997. The size of these dips and peaks are comparable to half the range of the data set.

The twelfth order order polynomial has a peak between 1996 and 1997 that is as large as the full range of the data.

The thirteenth order polynomial attentuates the 1996/1997 peak somewhat but at the price of reintroducing a peak between 1983 and 1984.

The fourteenth order polynomial is the largest one that can be fit. It zooms well outside the range of the data at several locations. Although you cannot see it on the scale of this graph, the fourteenth order polynomial produces predictions smaller than 0 and almost as large as 30.

An interesting exercise would be to show these graphs in sequence to a group of students and see when they start feeling uncomfortable about the fitted polynomial.

Karl Wuensch has a nice term for overfitting, Bumblebee Regression, and has published an image to illustrate this.