Illustrating splines using the Titanic dataset

Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a logistic regression model with data from survival of passengers on the Titanic.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to American in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats. Let’s look at the effect of age on survival using a logistic regression model.

Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.

Here are the first few rows of data.

                                           name pclass   age    sex survived
1                  Allen, Miss Elisabeth Walton    1st 29.00 female        1
2                   Allison, Miss Helen Loraine    1st  2.00 female        0
3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male        0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female        0
5                 Allison, Master Hudson Trevor    1st  0.92   male        1
6                            Anderson, Mr Harry    1st 47.00   male        1

Here are a few descriptive statistics

Length  Class   Mode 
     0   NULL   NULL

  pclass   n
1    1st 322
2    2nd 280
3    3rd 711

     sex   n
1 female 462
2   male 851

  survived   n
1        0 863
2        1 450

The boxplots reveal little differences between the ages of survivors and deaths. If something is going on, it is subtle.

Warning: Removed 557 rows containing non-finite outside the scale range
(`stat_boxplot()`).

I did not share the code up to this point because it is routine.

Fit a linear model first.

ti1 <- filter(ti0, !is.na(age))
linear_fit <- glm(
  survived ~ age,
  family=binomial,
  data=ti1)
tidy(linear_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -0.0814    0.174      -0.468  0.640 
2 age         -0.00879   0.00523    -1.68   0.0928

There may be a downward trend in the odds of survival over time, but it is not statistically significant.

Plot the predicted values.

age_sequence <- data.frame(
  age = seq(min(ti1$age), max(ti1$age), length=100))
age_sequence$prediction <- 
  predict(linear_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
  geom_point(pch=1) +
  geom_line(
    data=age_sequence,
    aes(age, prediction))

The plot is linear, but you should really look at it on the log odds scale because the logistic regression model is linear in the log odds. You need to change the predict function from type=“response” to type=“link” to get predictions on a log odds scale.

age_sequence$prediction <- 
  predict(linear_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
  geom_line()

Now fit a spline function.

spline_fit <- glm(
  survived ~ rcs(age),
  family=binomial,
  data=ti1)

number of knots in rcs defaulting to 5

tidy(spline_fit)

# A tibble: 5 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       1.29     0.399      3.24  0.00120 
2 rcs(age)age      -0.102    0.0296    -3.42  0.000616
3 rcs(age)age'      0.223    0.159      1.40  0.161   
4 rcs(age)age''    -0.141    1.20      -0.118 0.906   
5 rcs(age)age'''   -0.752    1.57      -0.478 0.633

The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically. First do this on the log odds scale to see how far from linear the spline fit is.

age_sequence$prediction <- 
  predict(spline_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
  geom_line()

Now plot on the original scale to get a picture of what is going on.

age_sequence$prediction <- 
  predict(spline_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
  geom_point(pch=1) +
  geom_line(
    data=age_sequence,
    aes(age, prediction))

It looks like “women and children” first might actually be “women, children, and old people first”.

An earlier version of this page was published on new.pmean.com.