Illustrating imputation of missing data using the Titanic dataset

*Blog post
2024
Missing data
R programming
Using R code
Author

Steve Simon

Published

May 7, 2024

Multiple imputation is a way to properly account for missing values without causing bias. Simpler forms of imputation, such as replacing missing values with the mean of the non-missing values, can produce serious problems. You might think that ignoring the missingness and relying just on records with complete data for all key variables would be acceptable, but this also can produce serious problems. I want to illustrate a simple example of multiple imputation using data on mortality from the Titanic.

Here is a brief description of this dataset, taken from the data dictionary on my github site.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.

Here are the first few rows of data.

                                           Name PClass   Age    Sex Survived
1                  Allen, Miss Elisabeth Walton    1st 29.00 female        1
2                   Allison, Miss Helen Loraine    1st  2.00 female        0
3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male        0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female        0
5                 Allison, Master Hudson Trevor    1st  0.92   male        1
6                            Anderson, Mr Harry    1st 47.00   male        1

I have hidden the R code up to this point, as it is mundane and not of great interest. I will show the R code and output for the rest of the analysis.

Notice the large number of missing values for age. The first three passengers with missing ages are

missing_rows <- which(is.na(ti0$Age))[1:3]
ti0 %>% slice(missing_rows)
                          Name PClass Age    Sex Survived
1 Aubert, Mrs Leontine Pauline    1st  NA female        1
2     Barkworth, Mr Algernon H    1st  NA   male        1
3           Baumann, Mr John D    1st  NA   male        0

Create multiply imputed values for age. The default is to use every variable in the dataset other than age to impute the value of age. You don’t want to use the name of the passenger, of course, so be sure to drop it before the imputation step.

ti1 <- ti0
ti1$i_female <- as.numeric(ti1$Sex=="female")
ti1 <- ti1[ , c("PClass", "Age", "Survived", "i_female")]
ti_mice <- mice(ti1)

 iter imp variable
  1   1  Age
  1   2  Age
  1   3  Age
  1   4  Age
  1   5  Age
  2   1  Age
  2   2  Age
  2   3  Age
  2   4  Age
  2   5  Age
  3   1  Age
  3   2  Age
  3   3  Age
  3   4  Age
  3   5  Age
  4   1  Age
  4   2  Age
  4   3  Age
  4   4  Age
  4   5  Age
  5   1  Age
  5   2  Age
  5   3  Age
  5   4  Age
  5   5  Age
Warning: Number of logged events: 1

The mice function creates a complex object. Go ahead and explore the various components, but be forewarned that this is messy. You can extract simple parts of the imputation with various functions. The complete function shows the complete datasets with the imputed values. Here are the first three rows where age was imputed the first time.

ti_mice %>%
  complete(1) %>%
  slice(missing_rows)
  PClass Age Survived i_female
1    1st  26        1        1
2    1st  21        1        0
3    1st  21        0        0

Look at the imputations for the second, third, etc. times. Notice that the values change with each imputation.

ti_mice %>%
  complete(2) %>%
  slice(missing_rows)
  PClass Age Survived i_female
1    1st  26        1        1
2    1st  28        1        0
3    1st  71        0        0
ti_mice %>%
  complete(3) %>%
  slice(missing_rows)
  PClass Age Survived i_female
1    1st   7        1        1
2    1st  32        1        0
3    1st  11        0        0
ti_mice %>%
  complete(4) %>%
  slice(missing_rows)
  PClass Age Survived i_female
1    1st   3        1        1
2    1st  65        1        0
3    1st   2        0        0
ti_mice %>%
  complete(5) %>%
  slice(missing_rows)
  PClass Age Survived i_female
1    1st  13        1        1
2    1st  20        1        0
3    1st  30        0        0

Use the with function of the mice package to fit a model to apply a particular analysis to each of the five imputed datasets.

ti_with <- with(
  ti_mice, 
  glm(
    Survived~PClass+rcs(Age, 3)+i_female, 
    family=binomial)) 

Again, the object created here is complex. You can extract the individual analyses fairly easily. Here is the analysis of the first imputed dataset.

ti_with$analyses[[1]]

Call:  glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)

Coefficients:
    (Intercept)        PClass2nd        PClass3rd   rcs(Age, 3)Age  
       0.925106        -1.104959        -2.426630        -0.038014  
rcs(Age, 3)Age'         i_female  
       0.001833         2.424994  

Degrees of Freedom: 1312 Total (i.e. Null);  1307 Residual
Null Deviance:      1688 
Residual Deviance: 1161     AIC: 1173

The key variable here is i_female, which shows how much different the log odds of survival are between men and women.

Notice that the estimate for sexmale changes from one analysis to another, though not by much.

ti_with$analyses[[2]]

Call:  glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)

Coefficients:
    (Intercept)        PClass2nd        PClass3rd   rcs(Age, 3)Age  
       0.098271        -0.977577        -2.266658        -0.008523  
rcs(Age, 3)Age'         i_female  
      -0.018944         2.336094  

Degrees of Freedom: 1312 Total (i.e. Null);  1307 Residual
Null Deviance:      1688 
Residual Deviance: 1180     AIC: 1192
ti_with$analyses[[3]]

Call:  glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)

Coefficients:
    (Intercept)        PClass2nd        PClass3rd   rcs(Age, 3)Age  
        0.23850         -0.84465         -2.14311         -0.03335  
rcs(Age, 3)Age'         i_female  
        0.03367          2.50795  

Degrees of Freedom: 1312 Total (i.e. Null);  1307 Residual
Null Deviance:      1688 
Residual Deviance: 1189     AIC: 1201
ti_with$analyses[[4]]

Call:  glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)

Coefficients:
    (Intercept)        PClass2nd        PClass3rd   rcs(Age, 3)Age  
        0.36634         -0.75314         -2.05923         -0.05359  
rcs(Age, 3)Age'         i_female  
        0.05074          2.45363  

Degrees of Freedom: 1312 Total (i.e. Null);  1307 Residual
Null Deviance:      1688 
Residual Deviance: 1180     AIC: 1192
ti_with$analyses[[5]]

Call:  glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)

Coefficients:
    (Intercept)        PClass2nd        PClass3rd   rcs(Age, 3)Age  
        0.20203         -0.89561         -2.19697         -0.02158  
rcs(Age, 3)Age'         i_female  
        0.01199          2.32473  

Degrees of Freedom: 1312 Total (i.e. Null);  1307 Residual
Null Deviance:      1688 
Residual Deviance: 1190     AIC: 1202

Combine all these analyses with the pool function.

ti_pool <- pool(ti_with)
ti_summary <- summary(ti_pool)
ti_summary
             term    estimate  std.error  statistic        df      p.value
1     (Intercept)  0.36604934 0.46148872  0.7931924 10.804840 4.447348e-01
2       PClass2nd -0.91518661 0.25080046 -3.6490627 33.051638 8.986452e-04
3       PClass3rd -2.21852089 0.24374033 -9.1019853 25.510460 1.721247e-09
4  rcs(Age, 3)Age -0.03101254 0.02195235 -1.4127211  7.523425 1.977431e-01
5 rcs(Age, 3)Age'  0.01585814 0.03363039  0.4715420  6.355279 6.530090e-01
6        i_female  2.40947851 0.17697951 13.6144487 67.957998 4.302603e-21

To complete things, compute the odds ratio and confidence interval.

female <- ti_summary$term=="i_female"
log_or <- ti_summary[female, "estimate"]
se <- ti_summary[female, "std.error"]
or <- round(exp(log_or), 1)
lo <- round(exp(log_or-1.96*se), 1)
hi <- round(exp(log_or+1.96*se), 1)
glue("Odds ratio for females is {or}, 95% CI ({lo},{hi})")
Odds ratio for females is 11.1, 95% CI (7.9,15.7)

An earlier version of this page was published on new.pmean.com.