Overfitting occurs when a statistical model is too complex for the amount of data that it is based on. The summary statistics on the data itself appear to be quite good, but the model will almost always produce poor predictions for new data. Here is an example of overfitting using data on hurricane frequencies from the Data and Story Library site.
Warning: package 'broom' was built under R version 4.5.1
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Here’s what the data looks like in a graph. Nothing too unusual about the data. Now let’s try to forecast the number of hurricanes for the next decade.
Plot the data with a linear prediction
Here’s a linear trend. The prediction is 6.3904762 for the next decade.
Plot the data with a quadratic prediction
Here’s a quadratic trend. The prediction is 3.4527473. It’s quite a bit different.
Plot the data with a cubic prediction
Here’s a cubic (third order polynomial trend. The prediction is now 1.4681319.
Plot the data with a 4th order polynomial
Here’s a quartic (fourth order polynomial) trend. The prediction is 3.8291708.
Plot the data with a 5th order polynomial
Here’s a fifth order polynomial trend. The prediction is 8.1412587.
Plot the data with a 6th order polynomial
Here’s where things get very weird. The sixth order polynomial produces a prediction of 24.1897436, which is more than twice as large as any previous value.
Plot the data with a 7th order polynomial
The weirdness continues with the seventh order polynomial trend, which produces a negative prediction (-0.5794872.
Plot the data with a 8th order polynomial
The eigth order polynomial trend also produces a negative prediction (-1.8747253).
Plot the data with a 9th order polynomial
Here’s a ninth order polynomial trend. The prediction is so extreme (-45.0805861) as to be ridiculous.
This code reads data directly from the website and prints out the data in two pieces.
Code (plot data points)
Here’s the code for plotting the data points. I save the graph in an object so I can add trend lines later on.
Code that I used
Here’s the code for adding a trend line to the graph. I am setting up a flexible function that can print the trend from any polynomial.
Code that I used
It took a bit of work to put everything in a function, but now you can produce a quadratic or higher level trend rather than a linear trend with just a single line of code.