P.Mean: Entropy as a measure of data quality across multiple variables (created 2008-08-25)

P.Mean: Entropy as a measure of data quality across multiple variables (created 2008-08-25).

This page is moving to a new website.

In a previous webpage, I discussed the use of cumulative entropy as a measure of data quality. A sudden shift in cumulative entropy that is not associated with a significant change in the research design is a possible marker for a data quality issue. The advantage of entropy is that it can be used for very large data sets where a context specific analysis of data quality is difficult or impractical.

Entropy can also be cumulated across multiple columns of data, to look for global shifts. How do you define entropy across multiple variables? I assume, for simplicity, that all the variables are categorical. An extension of the definition of entropy for continuous variables is possible, but I have not fully explored this yet.

The entropy for a single column of data involves listing the possible values for that column of data and the associated probabilities. Suppose that there are k possible values, with probabilities p₁, p₂, ..., p_k. Then entropy is defined as

Any category with zero probability should be discarded before calculating entropy.

For multiple columns of data, you can define the joint entropy by creating categories equal to all observed combinations of categories in the data set. Suppose, for example, that you have two variables, race (White, Black, Hispanic, Other) and sex (Male or Female). The joint entropy would simply be the entropy of a variable with eight possible values (White Male, Black Male, Hispanic Male, Other Male, White Female, Black Female, Hispanic Female, Other Female). You could also define the marginal entropy which would simply be the sum of the entropy for race and the entropy for sex.

It is unclear to me whether joint or marginal entropy works better for detecting data quality issues. The two values will be perfectly equal if the columns of data are exactly independent. There will be an increasing discrepancy as the association among the columns of data become stronger.

Here's an example of calculating joint and marginal entropy:

race sex [1,] "White" "Male" [2,] "White" "Male" [3,] "Other" "Male" [4,] "White" "Male" [5,] "Hispanic" "Male" [6,] "White" "Male" [7,] "Black" "Male" [8,] "Black" "Male" [9,] "White" "Male" [10,] "Other" "Male" [11,] "White" "Male" [12,] "Black" "Male" [13,] "Black" "Male" [14,] "Other" "Male" [15,] "Hispanic" "Female" [16,] "Hispanic" "Male" [17,] "White" "Male" [18,] "White" "Male" [19,] "White" "Male" [20,] "White" "Male" [21,] "Other" "Male" [22,] "Hispanic" "Female" [23,] "White" "Male" [24,] "Black" "Male" [25,] "Hispanic" "Female" [26,] "White" "Male" [27,] "White" "Male" [28,] "White" "Male" [29,] "White" "Male" [30,] "Hispanic" "Female"

The probabilities for race are 50%, 17%, 20%, and 13% and the probabilities for sex are 87% and 13%. The probabilities for the combined values are 50%, 0%, 17%, 0%, 7%, 13%, 7% and 0%. The entropy for sex is 0.57, the entropy for race is 1.78. The joint entropy is 1.97 and the marginal entropy is 2.35.

Here's a very simple example. Suppose there was a problem in data translation where one column of data was suddenly shifted unexpectedly. Here's an example. We have 10,000 rows and six columns of data. The first column represents the age of the patient with values of 1 through 10. The remaining columns of data represent Likert scale items with the earlier columns having a preponderance of 1s and 2s, and the later columns having a preponderance of 4s and 5s. Here are the first twenty rows of data.

[1,] 6 3 3 4 5 5 [2,] 4 1 1 2 3 3 [3,] 8 1 2 3 4 5 [4,] 7 1 2 2 4 4 [5,] 9 1 2 3 4 5 [6,] 2 3 3 5 5 5 [7,] 9 1 1 3 3 4 [8,] 10 1 2 1 3 4 [9,] 8 2 4 4 5 5 [10,] 3 2 3 4 5 4 [11,] 6 4 4 4 4 5 [12,] 10 1 1 3 4 3 [13,] 1 1 2 2 4 4 [14,] 4 1 3 4 5 5 [15,] 5 1 2 3 3 3 [16,] 5 1 3 4 5 5 [17,] 10 3 2 3 4 5 [18,] 5 1 2 4 3 4 [19,] 1 2 1 3 4 5 [20,] 10 2 2 3 4 5

I generated the Likert scale items by slicing a multivariate normal distribution.

sigma <- matrix(0.5,nrow=5,ncol=5) diag(sigma) <- rep(1,5) mu <- 0:4 mvn <- rmvnorm(10000,mu,sigma) qmat <- matrix(cut(mvn,c(-99,0:3,99),labels=FALSE),nrow=10000)

I generated the ages by shuffling (using the sample function).

age <- sample(rep(1:10,rep(1000,10)))

I simulated a shift in columns starting at row 8001 with the following code:

qmatx <- qmat qmatx[8001:10000,1:4] <- qmat[8001:10000,2:5] qmatx[8001:10000,5] <- 9 agex <- age agex[8001:10000] <- qmat[8001:10000,1]

The code effectively replaces the ages with the first likert scale, shift the remaining four likert scales to the left and add a missing code for the fifth likert scale.

Here is a plot of the cumulative marginal entropy.

The scaling here needs to be readjusted, but there is an obvious downward shift in entropy at 8000.

The marginal entropy shows a shift in the opposite direction.

Here the shift is obvious even without rescaling.

The problem with joint entropy in this example is that the number of categories possible when you combine five variables with five levels and a sixth (age) with ten levels is that the number of possible combined levels is extremely large. The simulated data does have some induced correlation which does cut down a bit on the number of joint categories, but it is still quite large. The joint entropy only makes sense if the number of observations is several orders of magnitude greater than the number of variables.

Also note that a change in either direction for cumulative entropy is of interest.

I plan to investigate this with further examples and extensions when I have time.

This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Information theory.