P.Mean >> Category >> Information theory (created 2007-08-09).

These pages describe information theory, a branch of mathematics developed by Claude Shannon in the 1940's to model signals going through telephone lines. Information theory has found a diverse range of applications in areas like file compression and genetics. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page.


20. P.Mean: Using information theory to identify discrepancies within and between text files (created 2010-09-02). I have been experimenting with the use of information theory to identify patterns in text data files. This work in somewhat preliminary, but it has some exciting possibilities. If there are certain patterns that occur frequently at a given column of a text data file (e.g., always the letters "A" or "B"), then these columns become important for looking for aberrant data that might be caused by a typographical error, a misalignment of the row of data, or a deviation from the code book. I want to show some preliminary graphs that illustrate what these patterns look like for some files I am working with. Warning: this is a very large webpage with graphics that extend across dozens of pages!!

19. P.Mean: Using entropy and the surprisal value to measure the degree of agreement with the consensus finding (created 2010-03-02). One of the research problems that I am working on involves evaluation of a subjective rating system. I have been using information theory to try to identify objects where the evaluators agree well and objects where the evaluators do not agree well. I also am working on identifying objects that an individual rater does poorly. The method is to measure when the surprisal of the category that a rater selected is much lower than the entropy (the average surprisal across all raters).

18. P.Mean: Ordinal surprisals (created 2010-03-20). Closely related to the concept of ordinal entropy is ordinal surprisals. The surprisal is the negative log base 2 of the probability, and if you multiply the probabilities with the surprisals and add them up, you get entropy. Can you define an ordinal surprisal in such a way that when you multiply the ordinal surprisals by the probabilities, you get the ordinal entropy?

17. P.Mean: Ordinal entropy (created 2010-03-11). I have been using the concept of entropy to evaluate a sperm morphology classification system and to identify aberrant records in large fixed format text files. Some of the data I have been using in these areas is ordinal with three levels, normal, borderline, and abnormal. In all of my work so far, I have treated all three categories symmetrically. So, for example, the entropy of a system where 50% of the probability is associated with normal and 50% is associated with borderline is 1. The entropy of a system where 50% of the probability is associated with normal and 50% is associated with abnormal is also 1. It has always bothered me a bit because it seems that the second case, where the probabilities are placed at the two extremes, should have a higher level of entropy. Here is a brief outline of how I think entropy ought to be redefined to take into account the ordinal nature of a variable.


16. P.Mean: DNA binding image (created 2009-03-25). There is an important application of information theory in DNA binding that I discussed at my old website. I may want to expand that discussion into an article for Chance Magazine. If I do, here is an open source image of DNA binding that might be useful.

15. P.Mean: The surprisal matrix and applications to exploration of very large discrete data sets (created 2009-03-04). The surprisal, defined as the negative of the base 2 logarithm of a probability, is a fundamental component used in the calculation of entropy. In this talk, I will define a surprisal matrix for a data set consisting of multiple discrete variables, possibly with different supports. The surprisal matrix is useful in identifying areas of high heterogeneity in such a data set, which often corresponds to interesting and unusual patterns among the observations or among the variables. I will illustrate two applications of the surprisal matrix: monitoring data quality in a large stream of fixed format text data, and examining consensus in the evaluation of sperm morphology.

14. P.Mean: Calibrating information using a two by two table (created 2009-01-28). In a previous webpage, I discussed the concept of joint entropy, conditional entropy, and information. The information for two measurements is zero if the two measurements are statistically independent. Information increases between two measurements as the degree of dependence (either positive or negative) increases. I thought it would be helpful to visualize this relationship graphically.

13. P.Mean: A simple example of joint and conditional entropy (created 2009-01-07). In a project involving sperm morphology classification, I have found that the concept of entropy very useful in analyzing the data and describing certain patterns. I want to extend the work to include joint and conditional entropy. I wanted to start with a simple data set, so I downloaded a file from the Data and Story Library website. There is an interesting file "High Fiber Diet Plan" that provides a useful way to explore joint and conditional entropy.


12. P.Mean: Categorizing entropy values (created 2008-09-17). I'm working on a project involving entropy where low values of entropy mean high levels of agreement (almost everybody classified a sperm cell as normal or almost everybody classified a sperm cell as abnormal). You might want to develop categories representing levels of agreement. I worked out a system, and it seemed like breaking entropy into levels based on multiples of 0.3 seemed to work well. Is there a rational basis for this multiple?

11. P.Mean: Use of entropy measures for sperm morphology classification (created 2008-09-13). Entropy is a measure used in quantum physics, communications, file compression, and statistics. There are a variety of informal interpretations for entropy. A high value of entropy implies a great deal of uncertainty, very little regularity and limited predictability. High entropy describes a process that is full of surprises. A low value of entropy implies limited uncertainty substantial regularity, and very good predictability. The lowest value for entropy is zero, which represents constancy, perfect regularity, and perfect predictability. Entropy is a useful measure for sperm morphology classifications, because it provides a quantitative way to assess the degree to which different laboratory technicians will apply sperm morphology classifications differently on the same set of sperm cell images.

10. P.Mean: Jackknife applied to entropy calculations (created 2008-09-15). I have been working with entropy for a couple of different projects and one important question to ask is "How much does the entropy change when a single observation is removed from the data set. This process of removing one item from a data set and recalculating a statistic based on the remaining (n-1) observations is called jackknifing. It is a very simple but still very useful technique in a variety of statistical settings.

9. P.Mean: Applying the sequence logo concept to data quality (created 2008-09-04). I am trying to adapt the logo graph used in genetics to an examination of data quality. I am just starting this, so the graphs are a bit crude. I took the 1973 NAMCS data set and calculated entropy for each column of data. This is a massive data set with 29,210 rows and 85 columns.

8. P.Mean: Checks for data quality using metadata (created 2008-08-28). I have been working on a series of webpages discussing automated checking for data quality. I have been proposing cumulative entropy as a measure to track sudden shifts in data entry characteristics. Entropy can be thought of as the amount of underlying heterogeneity in a data set, and if this value suddenly shifts upward or downward, it may be an indication that a change in how the data being entered has occurred. These quality checks can also be applied to metadata (data about the data itself). There are certain characteristics of a column of data that usually stay constant. A name, for example, is usually all letters, while an address is usually a mix of letters and numbers. A sudden shift in the composition of the data may indicate a problem.

7. P.Mean: Entropy as a measure of data quality across multiple variables (created 2008-08-25). In a previous webpage, I discussed the use of cumulative entropy as a measure of data quality. A sudden shift in cumulative entropy that is not associated with a significant change in the research design is a possible marker for a data quality issue. The advantage of entropy is that it can be used for very large data sets where a context specific analysis of data quality is difficult or impractical. Entropy can also be cumulated across multiple columns of data, to look for global shifts.

8. P.Mean: Cumulative entropy as a measure of data quality (created 2008-08-11). I was talking to someone about some of my work with control charts, and they asked a question out of the blue. A lot of data sources that might be candidates for my control chart software has potential problems with data quality. Did I have any thoughts about ways to screen for poor data quality?

Outside resources

Background frequencies for residue variability estimates: BLOSUM revisited. I. Mihalek , I. Res and O. Lichtarge BMC Bioinformatics 2007, 8:488doi:10.1186/1471-2105-8-488. [Abstract] [PDF]. Description: This paper adjusts the classic measure of entropy developed by Claude Shannon to account for different mutation probabilities.

Information Is Not Entropy, Information Is Not Uncertainty!. Thomas D. Schneider, National Cancer Institute. Excerpt: There are many many statements in the literature which say that information is the same as entropy. The reason for this was told by Tribus. The story goes that Shannon didn't know what to call his measure so he asked von Neumann, who said `You should call it entropy ... [since] ... no one knows what entropy really is, so in a debate you will always have the advantage' (Tribus1971). This website was last verified on 2008-01-14. URL: www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html

Linear Information Models: An Introduction. Philip E. Cheng, Jiun W. Liou, Michelle Liou and John A. D. Aston. Journal of Data Science, v.5, no.3, 297-313. [Abstract] [PDF]. Description: The classic analysis of variance model involves partitioning variances into several discrete components. You can use a similar approach for categorical data by partitioning measures of entropy and information. This article introduces how this is done for a few simple examples.

Sesame Street: Cookie Monster's Sorting Song.; 2008. Note: a common theme for my work in this area is captured by this Cookie Monster song: "One of these things is not like the other things." Here's the Sesame Street Description: "In this video, Cookie Monster plays a game with cookies. Sesame Street is a production of Sesame Workshop, a nonprofit educational organization which also produces Pinky Dinky Doo, The Electric Company, and other programs for children around the world. For more videos and games check out our new website at http://www.sesamestreet.org." [Accessed August 31, 2010]. Available at: http://www.youtube.com/watch?v=0WhuikFY1Pg&feature=youtube_gdata_player.

Deffeyes J, Harbourne R, Dejong S, et al. Use of information entropy measures of sitting postural sway to quantify developmental delay in infants. Journal of NeuroEngineering and Rehabilitation. 2009;6(1):34. Available at: http://www.jneuroengrehab.com/content/6/1/34 [Accessed August 18, 2009].

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-09-02. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


7. Stats: The HapMap project (December 12, 2005). One of the research projects I am involved with may make use of the HapMap project (www.hapmap.org). This project is an ambitious effort to document the frequency of most Single Nucleotide Polymorphisms (SNPs) in the Human Genome.

6. Stats: Information content of a continuous distribution (August 1, 2005). I was browsing through the book, Statistical Distributions Second Edition. Evans M, Hastings N, Peacock B (1993) New York: John Wiley & Sons. ISBN: 0471559512, when I noticed that they defined the information content of the exponential distribution. Very interesting, I thought, since I had been working on information theory models for categorical variables and had wondered how you might extend this to continuous variables.

5. Stats: Information theory and microarrays (June 1, 2005). I have been toying with the idea of using information theory in a microarray experiment to allow incorporation of the proportion of "absent: genes in the calculation of variance reduction. An "absent" gene in an Affymetrix array is a gene where the mismatch probes light up as brightly or more brightly as the perfect match probes. As I understand it, the signal associated with these genes is probably noted related to that gene itself but to some cross hybridizing genes.

4. Stats: More on information theory models (March 24, 2005). Some of the people I work with have used information theory extensively in their work. A good summary of their efforts appears in Automated splicing mutation analysis by information theory. Nalla VK, Rogan PK. Hum Mutat 2005: 25(4); 334-342, and Information analysis of human splice site mutations. Rogan PK, Faux BM, Schneider TD. Human Mutation 1998: 12(3); 153-71, and they have a website with software, https://splice.cmh.edu.


3. Stats: More on information theory models (August 31, 2004). I received a few suggestions for interesting web sites today. They deal with information theory and trying to establish causation.

2. Stats: Information theory models (May 26, 2004). This page provides some simple applications where information theory provides a useful analysis.

1. Stats: Information theory models (May 11, 2004). The past few weeks, I've been working on a web page that looks at concepts like entropy, uncertainty, and information theory. It started out as a simple definition of entropy, but grew so much that I split most of the material off into a separate handout on Information Theory Models. In the process of looking for web resources, I found a fascinating book, Information Theory, Inference, and Learning Algorithms by David J.C. MacKay (ISBN: 0521642981).

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-09-02.