Name PClass Age Sex Survived
1 Allen, Miss Elisabeth Walton 1st 29.00 female 1
2 Allison, Miss Helen Loraine 1st 2.00 female 0
3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0
5 Allison, Master Hudson Trevor 1st 0.92 male 1
6 Anderson, Mr Harry 1st 47.00 male 1
Multiple imputation is a way to properly account for missing values without causing bias. Simpler forms of imputation, such as replacing missing values with the mean of the non-missing values, can produce serious problems. You might think that ignoring the missingness and relying just on records with complete data for all key variables would be acceptable, but this also can produce serious problems. I want to illustrate a simple example of multiple imputation using data on mortality from the Titanic.
Here is a brief description of this dataset, taken from the data dictionary on my github site.
The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.
Here are the first few rows of data.
I have hidden the R code up to this point, as it is mundane and not of great interest. I will show the R code and output for the rest of the analysis.
Notice the large number of missing values for age. The first three passengers with missing ages are
<- which(is.na(ti0$Age))[1:3]
missing_rows %>% slice(missing_rows) ti0
Name PClass Age Sex Survived
1 Aubert, Mrs Leontine Pauline 1st NA female 1
2 Barkworth, Mr Algernon H 1st NA male 1
3 Baumann, Mr John D 1st NA male 0
Create multiply imputed values for age. The default is to use every variable in the dataset other than age to impute the value of age. You don’t want to use the name of the passenger, of course, so be sure to drop it before the imputation step.
<- ti0
ti1 $i_female <- as.numeric(ti1$Sex=="female")
ti1<- ti1[ , c("PClass", "Age", "Survived", "i_female")]
ti1 <- mice(ti1) ti_mice
iter imp variable
1 1 Age
1 2 Age
1 3 Age
1 4 Age
1 5 Age
2 1 Age
2 2 Age
2 3 Age
2 4 Age
2 5 Age
3 1 Age
3 2 Age
3 3 Age
3 4 Age
3 5 Age
4 1 Age
4 2 Age
4 3 Age
4 4 Age
4 5 Age
5 1 Age
5 2 Age
5 3 Age
5 4 Age
5 5 Age
Warning: Number of logged events: 1
The mice function creates a complex object. Go ahead and explore the various components, but be forewarned that this is messy. You can extract simple parts of the imputation with various functions. The complete function shows the complete datasets with the imputed values. Here are the first three rows where age was imputed the first time.
%>%
ti_mice complete(1) %>%
slice(missing_rows)
PClass Age Survived i_female
1 1st 26 1 1
2 1st 21 1 0
3 1st 21 0 0
Look at the imputations for the second, third, etc. times. Notice that the values change with each imputation.
%>%
ti_mice complete(2) %>%
slice(missing_rows)
PClass Age Survived i_female
1 1st 26 1 1
2 1st 28 1 0
3 1st 71 0 0
%>%
ti_mice complete(3) %>%
slice(missing_rows)
PClass Age Survived i_female
1 1st 7 1 1
2 1st 32 1 0
3 1st 11 0 0
%>%
ti_mice complete(4) %>%
slice(missing_rows)
PClass Age Survived i_female
1 1st 3 1 1
2 1st 65 1 0
3 1st 2 0 0
%>%
ti_mice complete(5) %>%
slice(missing_rows)
PClass Age Survived i_female
1 1st 13 1 1
2 1st 20 1 0
3 1st 30 0 0
Use the with function of the mice package to fit a model to apply a particular analysis to each of the five imputed datasets.
<- with(
ti_with
ti_mice, glm(
~PClass+rcs(Age, 3)+i_female,
Survivedfamily=binomial))
Again, the object created here is complex. You can extract the individual analyses fairly easily. Here is the analysis of the first imputed dataset.
$analyses[[1]] ti_with
Call: glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)
Coefficients:
(Intercept) PClass2nd PClass3rd rcs(Age, 3)Age
0.925106 -1.104959 -2.426630 -0.038014
rcs(Age, 3)Age' i_female
0.001833 2.424994
Degrees of Freedom: 1312 Total (i.e. Null); 1307 Residual
Null Deviance: 1688
Residual Deviance: 1161 AIC: 1173
The key variable here is i_female, which shows how much different the log odds of survival are between men and women.
Notice that the estimate for sexmale changes from one analysis to another, though not by much.
$analyses[[2]] ti_with
Call: glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)
Coefficients:
(Intercept) PClass2nd PClass3rd rcs(Age, 3)Age
0.098271 -0.977577 -2.266658 -0.008523
rcs(Age, 3)Age' i_female
-0.018944 2.336094
Degrees of Freedom: 1312 Total (i.e. Null); 1307 Residual
Null Deviance: 1688
Residual Deviance: 1180 AIC: 1192
$analyses[[3]] ti_with
Call: glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)
Coefficients:
(Intercept) PClass2nd PClass3rd rcs(Age, 3)Age
0.23850 -0.84465 -2.14311 -0.03335
rcs(Age, 3)Age' i_female
0.03367 2.50795
Degrees of Freedom: 1312 Total (i.e. Null); 1307 Residual
Null Deviance: 1688
Residual Deviance: 1189 AIC: 1201
$analyses[[4]] ti_with
Call: glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)
Coefficients:
(Intercept) PClass2nd PClass3rd rcs(Age, 3)Age
0.36634 -0.75314 -2.05923 -0.05359
rcs(Age, 3)Age' i_female
0.05074 2.45363
Degrees of Freedom: 1312 Total (i.e. Null); 1307 Residual
Null Deviance: 1688
Residual Deviance: 1180 AIC: 1192
$analyses[[5]] ti_with
Call: glm(formula = Survived ~ PClass + rcs(Age, 3) + i_female, family = binomial)
Coefficients:
(Intercept) PClass2nd PClass3rd rcs(Age, 3)Age
0.20203 -0.89561 -2.19697 -0.02158
rcs(Age, 3)Age' i_female
0.01199 2.32473
Degrees of Freedom: 1312 Total (i.e. Null); 1307 Residual
Null Deviance: 1688
Residual Deviance: 1190 AIC: 1202
Combine all these analyses with the pool function.
<- pool(ti_with)
ti_pool <- summary(ti_pool)
ti_summary ti_summary
term estimate std.error statistic df p.value
1 (Intercept) 0.36604934 0.46148872 0.7931924 10.804840 4.447348e-01
2 PClass2nd -0.91518661 0.25080046 -3.6490627 33.051638 8.986452e-04
3 PClass3rd -2.21852089 0.24374033 -9.1019853 25.510460 1.721247e-09
4 rcs(Age, 3)Age -0.03101254 0.02195235 -1.4127211 7.523425 1.977431e-01
5 rcs(Age, 3)Age' 0.01585814 0.03363039 0.4715420 6.355279 6.530090e-01
6 i_female 2.40947851 0.17697951 13.6144487 67.957998 4.302603e-21
To complete things, compute the odds ratio and confidence interval.
<- ti_summary$term=="i_female"
female <- ti_summary[female, "estimate"]
log_or <- ti_summary[female, "std.error"]
se <- round(exp(log_or), 1)
or <- round(exp(log_or-1.96*se), 1)
lo <- round(exp(log_or+1.96*se), 1)
hi glue("Odds ratio for females is {or}, 95% CI ({lo},{hi})")
Odds ratio for females is 11.1, 95% CI (7.9,15.7)
An earlier version of this page was published on new.pmean.com.