P.Mean: Lessons learned the hard way: don't presume to know how your software handles missing value codes (created 2010-05-28)

This page is moving to a new website.

I'm working on an interesting project that involves summing up costs across certain records for a given patient. Some of the costs are missing. How should the program handle these missing costs. We discussed this by email and agreed to ignore missing costs in the sum. This is effectively the same as replacing the missing costs with zero. There is two cases worth worrying about, though, and handling those cases makes me realize just how tricky missing values are.

Case #1 is when a patient has no records that meet the qualification. Case #2 is when all of the qualifying records have missing values for costs. I want to create different codes for these two cases. I'm using R, but the trickiness would occur with just about any statististical software. And the problem, unfortunately, is that there is no consistency in how other programs handle missing values. So don't look too literally at the R code if you use SAS or SPSS, but rather take it as a warning that you need to understand what your computer is doing.

One of the things I need to compute is the total expenditures in a given time from for acupuncture treatment. The code for this is

acp <- (locserv %in% 2:3) & (provgroup=="2") & (provtype %in% c("ACP","BKA"))

Now, the way to compute a sum for each patient is to use the "sum" function within "tapply".

acpcosts <- tapply(costs[acp==1],id[acp==1],sum,na.rm=TRUE)

In SAS, you'd use PROC MEANS and a BY statement and in SPSS, you would use AGGREGATE. So what happens if a given patient does not have any records? In most software, you would just skip that patient and pretend they never existed. And that it what R does. There are 34,455 unique patients in the file, but only 502 actually had an accupuncture treatment during the appropriate time frame. So the remaining 33,953 patients are ignored.

How does R handle missing values? Well, I told R that I wanted it to discard any missing values by including the

na.rm=TRUE

option. If you told R to compute

sum(c(1,2,3,NA))

it would produce the value of NA (NA is the missing value code in R). This is logical: if one of the items in a sum is unknown, then the sum must be unknown. To avoid this, you would write something like

sum(c(1,2,3,NA),na.rm=TRUE)

which would produce the value of 6. But how would this work for

sum(c(NA,NA,NA),na.rm=TRUE)

The answer is not immediately clear. You remove the NAs and there is nothing left. Should the sum of an empty vector be zero? Or should it be missing? One hint as to how R would handle this comes from trying the statement

sum(NULL)

which produces a value of zero (NULL is the R term for an empty vector). And when you test

sum(c(NA,NA,NA),na.rm=TRUE)

directly, it does produce zero. To be safe, I also tested

sum(NA,na.rm=TRUE)

and

sum(c(NA,NA),na.rm=TRUE)

and they also produced values of zero. Among the 502 patients with acupuncture, exactly 2 had a sum of zero. Since there are no zero or negative values for costs (I checked), this could only happen if all the costs are missing.

Now I want to code the data as 0 for the 33,953 patients with no acupuncture visits and -1 for the 2 patients with one or more acupuncture visits but all the costs are missing. When you merge the accupuncture sums with the master file, it will create a new file with only 502 records. Why? Because the default in a merge is to only include records if they are found in both files. You can overcome this by using

merge(master,acpcosts,by=id,all=TRUE)

and this will include a row in the new data set if you can find a record in either file. It will end up producing missing values in appropriate locations to fill in the data set properly.

But I don't want NAs for the 33,953 patients. I want them to have zeros. So the logical thing to do is to change the 2 zeros to -1 and the 33,953 NAs to zero. Here's some code that looks like it would work.

newdata <- merge(master,acpcosts,by=id,all=TRUE)
newdata$acp[newdata$acpcosts==0] <- -1
newdata$acp[newdata$acpcosts==NA] <- 0

It doesn't work, though, because logic statements involving missing values are tricky. Here's a simple example. Type the following code into R.

x <- c(1,2,3,NA)
x==2

You would think that the second statement would produce the result

FALSE TRUE FALSE FALSE

but it actually produces the result

FALSE TRUE FALSE NA

If we don't know what the fourth value of a vector is, we can't assign a value of either TRUE or FALSE to almost all logical statements involving that value (email me if you know what some of the exceptions might be and you might win a special prize). What is a bit surprising is that NA==NA evaluates to NA. If you don't know what the left side is and you don't know what the right side is, you can't say if they are equal, can you? So

x <- c(1,2,3,NA)
x==NA

produces the unhelpful result

NA NA NA NA

So how can you find and convert all the missing values to zero? R includes the function is.na for this. So

x <- c(1,2,3,NA)
is.na(x)

produces

FALSE FALSE FALSE TRUE

There's one more potential problem, though. If we try

newdata <- merge(master,acpcosts,by=id,all=TRUE)
newdata$acp[newdata$acpcosts==0] <- -1
newdata$acp[is.na(newdata$acpcosts)] <- 0

how will the program handle NA values in the second statement. Will they be ignored? Will they be converted? The safer thing to do is to convert the NA values first, then convert the zeros.

newdata <- merge(master,acpcosts,by=id,all=TRUE)
newdata$acp[is.na(newdata$acpcosts)] <- -2
newdata$acp[newdata$acpcosts==0] <- -1
newdata$acp[newdata$acpcosts==-2] <- 0

Of course, we have to convert to a temporary value (-2) so as to not mix up the 33,953 NA values destined to be zero with the 2 zero values destined to be -1.

Now you might be tempted to ask why R doesn't process a statement like

x==NA

in a "logical" way so that it is TRUE for missing values and FALSE for nonmissing values? That's what SAS does. SAS lets the missing value code be an impossibly big negative number. This helps in some settings but hurts in others. If you are interested in selecting all the cases where x<2, SAS will select values of 1, 1.9, -93, and so forth, but it will also select the missing value. So if you wanted to create a new variable that is equal to 1 if x<2 and equal to 0 if x>=2 and equal to missing if x is missing, you have to work harder in SAS to get this than in R.

The bottom line is that missing values will always produce ambiguous situations (you can show this mathematically), and every statistical software program will handle missing values in a slightly different way. There is no standard operating procedure for missing values. You can force logic statements involving missing values to be either TRUE or FALSE like SAS does, or you can allow a logic statement to produce missing values like R does. Either way, you will end up with tricky situations where the program behaves in a way you wouldn't normally expect. Eventually, you'll figure things out, but it takes a lot of trial and error testing.

P.S. All this work to properly account for 2 records out of 34,455? Yes, but that's part of my job. Also, it pops up not just with acupuncture but with several other procedures as well. I'm detail oriented, and that has helped me well in my career as a statistician.

P.P.S. I'm giving a talk based on this webpage for the Kansas City R Users Group, and everyone was asked to bring a favorite graph. Here's a favorite of mine showing various gradients of color.

ap <- (0:100)/100
aq <- rev(ap)

a0 <- rep(0,101)
a1 <- rep(1,101)
p
lot(0:1,0:1,type="n")
points(
ap,a0,col=rgb( 0,ap, 0)) # bottom: black to green
points(a1,ap,col=rgb(ap, 1, 0)) # right: green to yellow
points(aq,a1,col=rgb( 1, 1,ap)) # top: yellow to white
points(a0,aq,col=rgb(aq,aq,aq)) # left: white to black
points(ap,ap,col=rgb(ap,ap, 0))
# diagonal: black to yellow
points(ap,aq,col=rgb(aq, 1,aq)) # diagonal: white to green

Illustration of color gradients