I found a larger data set and wanted to show how you could use the Kaplan Meier curves as a preliminary screen of some categorical and continuous variables in a larger and more complex data set.
I am building on a previous blog entry and this referred to an earlier webpage. Here is the R code. The comment section at the top is a cut-and-paste from the documentation of the data file.
# Some simple examples of survival analysis
# The data set is a study of primary biliary cirrhosis
# Variables:
# V1 case number
# V2 number of days between registration and the earlier of death,
# transplantion
- or study analysis time in July
- 1986 # V3 status # 0=censored
- 1=censored due to liver tx
- 2=death # V4 drug # 1= D-penicillamine
- 2=placebo # V5 age in days # V6 sex # 0=male
- 1=female # V7 presence of asictes # 0=no 1=yes # V8 presence of hepatomegaly # 0=no 1=yes # V9 presence of spiders # 0=no 1=yes # V10 presence of edema
# 0 = no edema and no diuretic therapy for edema; # 0.5 = edema present without diuretics - or edema resolved by diuretics; # 1 = edema despite diuretic therapy # V11 serum bilirubin in mg/dl # V12 serum cholesterol in mg/dl # V13 albumin in gm/dl # V14 urine copper in ug/day# # V15 alkaline phosphatase in U/liter # V16 SGOT in U/ml # V17 triglicerides in mg/dl # V18 platelets per cubic ml / 1000 # V19 prothrombin time in seconds # V20 histologic stage of disease
Read the data in from the website. The first 60 rows are documentation about the file and the actual data starts on line 61.
fn <- "http://lib.stat.cmu.edu/datasets/pbc"
pbc <- read.table(fn,header=FALSE,skip=60)
head(pbc)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 1 400 2 1 21464 1 1 1 1 1.0 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 2 4500 0 1 20617 1 0 1 1 0.0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
3 3 1012 2 1 25594 0 0 0 0 0.5 1.4 176 3.48 210 516.0 96.10 55 151 12.0 4
4 4 1925 2 1 19994 1 0 1 1 0.5 1.8 244 2.54 64 6121.8 60.63 92 183 10.3 4
5 5 1504 1 2 13918 1 0 1 1 0.0 3.4 279 3.53 143 671.0 113.15 72 136 10.9 3
6 6 2503 2 2 24201 1 0 1 0 0.0 0.8 248 3.98 50 944.0 93.00 63 . 11.0 3
The data looks like it read in properly - so let’s load the survival library.
> library("survival")
Loading required package: splines
Attaching package: <U+0091>survival<U+0092>
The following object is masked _by_ <U+0091>.GlobalEnv<U+0092>:
pbc
Now that’s strange. There is an object in the survival library that has the same name (pbc). Let’s investigate.
> rm(pbc)
> head(pbc)
id time status trt age sex ascites hepato spiders edema bili chol albumin copper alk.phos
1 1 400 2 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156 1718.0
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54 7394.8
3 3 1012 2 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210 516.0
4 4 1925 2 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64 6121.8
5 5 1504 1 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143 671.0
6 6 2503 2 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50 944.0
ast trig platelet protime stage
1 137.95 172 190 12.2 4
2 113.52 88 221 10.6 3
3 96.10 55 151 12.0 4
4 60.63 92 183 10.3 4
5 113.15 72 136 10.9 3
6 93.00 63 NA 11.0 3
Wow! The data set I was importing was already part of the survival library. Let’s use their version - since it has better names for the variables than V1-V20.
We have to subset the data because the first 312 observations represent a randomized trial and the remaining observations are an add-on study of baseline values for some non-randomized patients.
> tail(pbc)
id time status trt age sex ascites hepato spiders edema bili chol albumin copper
413 413 989 0 NA 35.00068 f NA NA NA 0 0.7 NA 3.23 NA
414 414 681 2 NA 67.00068 f NA NA NA 0 1.2 NA 2.96 NA
415 415 1103 0 NA 39.00068 f NA NA NA 0 0.9 NA 3.83 NA
416 416 1055 0 NA 56.99932 f NA NA NA 0 1.6 NA 3.42 NA
417 417 691 0 NA 58.00137 f NA NA NA 0 0.8 NA 3.75 NA
418 418 976 0 NA 52.99932 f NA NA NA 0 0.7 NA 3.29 NA
alk.phos ast trig platelet protime stage
413 NA NA NA 312 10.8 3
414 NA NA NA 174 10.9 3
415 NA NA NA 180 11.2 4
416 NA NA NA 143 9.9 3
417 NA NA NA 269 10.4 3
418 NA NA NA 350 10.6 4
> pmod <- pbc[1:312,]
> tail(pmod)
id time status trt age sex ascites hepato spiders edema bili chol albumin copper
307 307 1149 0 2 30.57358 f 0 0 0 0 0.8 273 3.56 52
308 308 1153 0 1 61.18275 f 0 1 0 0 0.4 246 3.58 24
309 309 994 0 2 58.29979 f 0 0 0 0 0.4 260 2.75 41
310 310 939 0 1 62.33265 f 0 0 0 0 1.7 434 3.35 39
311 311 839 0 1 37.99863 f 0 0 0 0 2.0 247 3.16 69
312 312 788 0 2 33.15264 f 0 0 1 0 6.4 576 3.79 186
alk.phos ast trig platelet protime stage
307 1282 130 59 344 10.5 2
308 797 91 113 288 10.4 2
309 1166 70 82 231 10.8 2
310 1713 171 100 234 10.2 2
311 1050 117 88 335 10.5 2
312 2115 136 149 200 10.8 2
Take a look at the two key variables needed to define the survival object. Dividing by 365.25 gives the times in years rather than in days.
> summary(pmodtime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
41 1191 1840 2006 2697 4556
> summary(pmodtime/365.25)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1123 3.2610 5.0360 5.4930 7.3850 12.4700
> table(pmodstatus)
0 1 2
168 19 125
The values for status - if you read the documentation - are 0=censored, 1=transplant - and 2=death. You could define an event as either 1 and 2 or 2 by itself. Choose the latter for this example.
> p.surv >- Surv(pmodtime,pmod$status==2)
print(p.surv)
[1] 400 4500+ 1012 1925 1504+ 2503 1832+ 2466 2400 51 3762 304 3577+ 1217 3584
[16] 3672+ 769 131 4232+ 1356 3445+ 673 264 4079 4127+ 1444 77 549 4509+ 321
[31] 3839 4523+ 3170 3933+ 2847 3611+ 223 3244 2297 4467+ 1350 4453+ 4556+ 3428 4025+
[46] 2256 2576+ 4427+ 708 2598 3853 2386 1000 1434 1360 1847 3282 4459+ 2224 4365+
[61] 4256+ 3090 859 1487 3992+ 4191 2769 4039+ 1170 3458+ 4196+ 4184+ 4190+ 1827 1191
[76] 71 326 1690 3707+ 890 2540 3574 4050+ 4032+ 3358 1657 198 2452+ 1741 2689
[91] 460 388 3913+ 750 130 3850+ 611 3823+ 3820+ 552 3581+ 3099+ 110 3086 3092+
[106] 3222 3388+ 2583 2504+ 2105 2350+ 3445 980 3395 3422+ 3336+ 1083 2288 515 2033+
[121] 191 3297+ 971 3069+ 2468+ 824 3255+ 1037 3239+ 1413 850 2944+ 2796 3149+ 3150+
[136] 3098+ 2990+ 1297 2106+ 3059+ 3050+ 2419 786 943 2976+ 2615+ 2995+ 1427 762 2891+
[151] 2870+ 1152 2863+ 140 2666+ 853 2835+ 2475+ 1536 2772+ 2797+ 186 2055 264 1077
[166] 2721+ 1682 2713+ 1212 2692+ 2574+ 2301+ 2657+ 2644+ 2624+ 1492 2609+ 2580+ 2573+ 2563+
[181] 2556+ 2555+ 2241+ 974 2527+ 1576 733 2332+ 2456+ 2504+ 216 2443+ 797 2449+ 2330+
[196] 2363+ 2365+ 2357+ 1592+ 2318+ 2294+ 2272+ 2221+ 2090 2081 2255+ 2171+ 904 2216+ 2224+
[211] 2195+ 2176+ 2178+ 1786 1080 2168+ 790 2170+ 2157+ 1235 2050+ 597 334 1945+ 2022+
[226] 1978+ 999 1967+ 348 1979+ 1165 1951+ 1932+ 1776+ 1882+ 1908+ 1882+ 1874+ 694 1831+
[241] 837+ 1810+ 930 1690 1790+ 1435+ 732+ 1785+ 1783+ 1769+ 1457+ 1770+ 1765+ 737+ 1735+
[256] 1701+ 1614+ 1702+ 1615+ 1656+ 1677+ 1666+ 1301+ 1542+ 1084+ 1614+ 179 1191 1363+ 1568+
[271] 1569+ 1525+ 1558+ 1447+ 1349+ 1481+ 1434+ 1420+ 1433+ 1412+ 41 1455+ 1030+ 1418+ 1401+
[286] 1408+ 1234+ 1067+ 799 1363+ 901+ 1329+ 1320+ 1302+ 877+ 1321+ 533+ 1300+ 1293+ 207
[301] 1295+ 1271+ 1250+ 1230+ 1216+ 1216+ 1149+ 1153+ 994+ 939+ 839+ 788+
First - let’s look at an overall survival curve.
> plot(survfit(p.surv~1))
Now let’s define some functions that will calculate and compare Kaplan-Meier curves across all the possible covariates in the model. For categorical variables - draw Kaplan-Meier curves for each category level. For continuous variables - split the data into quartiles and then draw Kaplan-Meier curves for each quartile.
km.cat <- function(vn) {
print(vn)
print(table(pmod[,vn]))
km.all(pmod[,vn])
title(vn)
}
km.con <- function(vn) {
print(vn)
br <- quantile(pmod[,vn],na.rm=TRUE)
br[1] <- 0
quartiles <- cut(pmod[,vn],breaks=br)
print(table(quartiles))
km.all(quartiles)
title(vn)
}
km.all <- function(x) {
tb <- table(x)
mi <- sum(is.na(x))
sf <- survfit(p.surv~x)
print(paste("There are",mi,"missing values."))
plot(sf,xlim=c(0,6000))
end.pts <- cumsum(sfstrata)
end.x <- sftime[end.pts]+100
end.y <- sfsurv[end.pts]
end.nm <- names(tb)
text(end.x,end.y,end.nm,cex=1.5,col="darkred",adj=0)
}
v.cat <- names(pmod)[c(4,6:10,20)]
v.con <- names(pmod)[c(5,11:19)]
for (v in v.cat) {
km.cat(v)
}
for (v in v.con) {
km.con(v)
}
Here is the output.
<- names(pmod)[c(4,6:10,20)]
v.cat <- names(pmod)[c(5,11:19)]
v.con for (v in v.cat) {
km.cat(v)
}
## [1] "trt"
##
## 1 2
## 158 154
## [1] "There are 0 missing values."
## [1] "sex"
##
## m f
## 36 276
## [1] "There are 0 missing values."
## [1] "ascites"
##
## 0 1
## 288 24
## [1] "There are 0 missing values."
## [1] "hepato"
##
## 0 1
## 152 160
## [1] "There are 0 missing values."
## [1] "spiders"
##
## 0 1
## 222 90
## [1] "There are 0 missing values."
## [1] "edema"
##
## 0 0.5 1
## 263 29 20
## [1] "There are 0 missing values."
## [1] "stage"
##
## 1 2 3 4
## 16 67 120 109
## [1] "There are 0 missing values."
for (v in v.con) {
km.con(v)
}
## [1] "age"
## quartiles
## (0,42.2] (42.2,49.8] (49.8,56.7] (56.7,78.4]
## 78 78 78 78
## [1] "There are 0 missing values."
## [1] "bili"
## quartiles
## (0,0.8] (0.8,1.35] (1.35,3.42] (3.42,28]
## 90 66 78 78
## [1] "There are 0 missing values."
## [1] "chol"
## quartiles
## (0,250] (250,310] (310,400] (400,1.78e+03]
## 71 71 72 70
## [1] "There are 28 missing values."
## [1] "albumin"
## quartiles
## (0,3.31] (3.31,3.55] (3.55,3.8] (3.8,4.64]
## 81 76 81 74
## [1] "There are 0 missing values."
## [1] "copper"
## quartiles
## (0,41.2] (41.2,73] (73,123] (123,588]
## 78 80 75 77
## [1] "There are 2 missing values."
## [1] "alk.phos"
## quartiles
## (0,872] (872,1.26e+03] (1.26e+03,1.98e+03]
## 78 78 78
## (1.98e+03,1.39e+04]
## 78
## [1] "There are 0 missing values."
## [1] "ast"
## quartiles
## (0,80.6] (80.6,115] (115,152] (152,457]
## 79 78 79 76
## [1] "There are 0 missing values."
## [1] "trig"
## quartiles
## (0,84.2] (84.2,108] (108,151] (151,598]
## 71 71 70 70
## [1] "There are 30 missing values."
## [1] "platelet"
## quartiles
## (0,200] (200,257] (257,322] (322,563]
## 77 77 77 77
## [1] "There are 4 missing values."
## [1] "protime"
## quartiles
## (0,10] (10,10.6] (10.6,11.1] (11.1,17.1]
## 89 85 61 77
## [1] "There are 0 missing values."