name pclass age sex survived
1 Allen, Miss Elisabeth Walton 1st 29.00 female 1
2 Allison, Miss Helen Loraine 1st 2.00 female 0
3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0
5 Allison, Master Hudson Trevor 1st 0.92 male 1
6 Anderson, Mr Harry 1st 47.00 male 1
Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a logistic regression model with data from survival of passengers on the Titanic.
The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to American in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats. Let’s look at the effect of age on survival using a logistic regression model.
Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.
The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.
Here are the first few rows of data.
Here are a few descriptive statistics
Length Class Mode
0 NULL NULL
pclass n
1 1st 322
2 2nd 280
3 3rd 711
sex n
1 female 462
2 male 851
survived n
1 0 863
2 1 450
The boxplots reveal little differences between the ages of survivors and deaths. If something is going on, it is subtle.
Warning: Removed 557 rows containing non-finite outside the scale range
(`stat_boxplot()`).
I did not share the code up to this point because it is routine.
Fit a linear model first.
<- filter(ti0, !is.na(age))
ti1 <- glm(
linear_fit ~ age,
survived family=binomial,
data=ti1)
tidy(linear_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0814 0.174 -0.468 0.640
2 age -0.00879 0.00523 -1.68 0.0928
There may be a downward trend in the odds of survival over time, but it is not statistically significant.
Plot the predicted values.
<- data.frame(
age_sequence age = seq(min(ti1$age), max(ti1$age), length=100))
$prediction <-
age_sequencepredict(linear_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
geom_point(pch=1) +
geom_line(
data=age_sequence,
aes(age, prediction))
The plot is linear, but you should really look at it on the log odds scale because the logistic regression model is linear in the log odds. You need to change the predict function from type=“response” to type=“link” to get predictions on a log odds scale.
$prediction <-
age_sequencepredict(linear_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
geom_line()
Now fit a spline function.
<- glm(
spline_fit ~ rcs(age),
survived family=binomial,
data=ti1)
number of knots in rcs defaulting to 5
tidy(spline_fit)
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.29 0.399 3.24 0.00120
2 rcs(age)age -0.102 0.0296 -3.42 0.000616
3 rcs(age)age' 0.223 0.159 1.40 0.161
4 rcs(age)age'' -0.141 1.20 -0.118 0.906
5 rcs(age)age''' -0.752 1.57 -0.478 0.633
The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically. First do this on the log odds scale to see how far from linear the spline fit is.
$prediction <-
age_sequencepredict(spline_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
geom_line()
Now plot on the original scale to get a picture of what is going on.
$prediction <-
age_sequencepredict(spline_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
geom_point(pch=1) +
geom_line(
data=age_sequence,
aes(age, prediction))
It looks like “women and children” first might actually be “women, children, and old people first”.
An earlier version of this page was published on new.pmean.com.