P.Mean: Checks for data quality using metadata (created 2008-08-28)

P.Mean: Checks for data quality using metadata (created 2008-08-28).

This page has moved to a new website.

I have been working on a series of webpages discussing automated checking for data quality. Automated checking is not a substitute for context specific examination of data quality (e.g., prostate cancer in a female subject is probably an indication that something is unusual). For very large data sets, however, context specific checking may be limited by time constraints. A set of simple checks that can implemented quickly and automatically.

Data quality checks are also not a substitute for designing a robust error proof system of data entry up front.

I have been proposing cumulative entropy as a measure to track sudden shifts in data entry characteristics. Entropy can be thought of as the amount of underlying heterogeneity in a data set, and if this value suddenly shifts upward or downward, it may be an indication that a change in data entry practices has occurred. This may be due to real changes in the clinical trial (e.g., a change in the entry criteria that leads to a more expansive or more restrictive pool of subjects). Or it may indicate a problem with the data entry (creation or omission of new categories, data shifting into the wrong columns).

These quality checks can also be applied to metadata (data about the data itself). There are certain characteristics of a column of data that usually stay constant. A name, for example, is usually all letters, while an address is usually a mix of letters and numbers. A sudden shift in the composition of the data may indicate a problem.

Here are a series of binary metadata functions that could be applied quickly to a column of data. The joint or marginal entropy of these metadata functions could then be examined.

Does the data contain upper case letters (A through Z)?

Does the data contain lower case letters (a through z)?

Does the data contain numbers (0 through 9)?

Does the data contain special symbols?

The list of symbols could be tweaked a bit, but I would consider pulling out a subset of symbols that are normally associated with numeric data (-,.).

If you see a sudden shift in entropy, that might indicate that a data coder has decided to put their caps lock button on, or has substituted letter codes for number codes in a column of data. In many situations, the metadata is constant. An address, for example, will almost always include upper and lower case variables, and numbers, but no symbols (let's assume that things like "Apt. #3G" are stored in a separate column). So the metadata would almost always code as 1110 on the four question above. An address that coded as 1100 would be an unusual address because it would not have a number.

So if the entropy for a column of data suddenly shifted from zero to a small positive value, that would be an indication of data that is not fitting the normal pattern that you have come to expect for this data.

There are some discrete metadata functions that could also be applied, such as the length of a string, or the number of decimal places included. If you had a column of data that included a series of one letter responses Y or N, and all of a sudden some two and three letter responses appear (YES and NO). then this indicates a sudden shift in data entry patterns. If you have a series of birthweights in grams and suddenly the birthweights are entered in kilograms instead, you will see a surge in the number of data points with three decimals places rather than zero.

As noted in an earlier post, cumulative entropy will not make sense if data errors are sprinkled randomly throughout the data set, and they make sense only when the data is ordered by a variable that is unrelated to important research variables in the data set.

The cumulative entropy can also lead to false positives, because sometimes a sudden surge or decline can be accounted for by specific changes in the data itself unrelated to data quality issues. For example, if a multi-center trial opens its the first international center halfway through the trial, that would almost certainly lead to a false alarm.

Still, an automated check has value because it will focus your attention in "interesting" regions of the data set that are likely to have a higher probability of changes in the data entry system compared to random regions of the data set.

I hope to explore some of these ideas on real data sets (perhaps some of the large scale surveys available through CDC) and to explore some of the theoretical properties of cumulative entropy (such as how to provide measured of normal variation in cumulative entropy) as I have time.

This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Information theory.