StATS: The HapMap project (December 12, 2005)

One of the research projects I am involved with may make use of the HapMap project (www.hapmap.org). This project is an ambitious effort to document the frequency of most Single Nucleotide Polymorphisms (SNPs) in the Human Genome. A SNP is a location on the genome where some subjects show differences in a single nucleotide. The HapMap project looks at four populations:

There is data on all the chromosomes, but I started some analyses using the Y chromosome since it has a very small number of SNPs. In the CEU file, there are 69, in CHB there are 68, in JPT there are 68, and in YRI there are 64. Across all four populations, there are 51 SNPs in common. Some of these SNPs are uninteresting, because 100% of the subjects tested so far had the same nucleotide. Eliminating these SNPs, and you have 15 remaining. Here is the raw data on these 15 SNPs.

rs2534636 2302306  ceu C/C C/T T/T  27   0   1  28
rs2534636 2302306  chb C/C C/T T/T  20   0   0  20
rs2534636 2302306  jpt C/C C/T T/T  20   0   0  20
rs2534636 2302306  yri C/C C/T T/T  30   0   0  30
                                    97   0   1  98

rs9786608 2309073  ceu C/C C/T T/T  27   0   0  27
rs9786608 2309073  chb C/C C/T T/T  21   0   0  21
rs9786608 2309073  jpt C/C C/T T/T  19   0   0  19
rs9786608 2309073  yri C/C C/T T/T   0   0  28  28
                                    67   0  28  95

rs3899    6994490  ceu G/G G/T T/T  30   0   0  30
rs3899    6994490  chb G/G G/T T/T  22   0   0  22
rs3899    6994490  jpt G/G G/T T/T  20   0   2  22
rs3899    6994490  yri G/G G/T T/T  30   0   0  30
                                   102   0   2 104

rs9786896 8322494  ceu A/A A/G G/G   0  21   9  30
rs9786896 8322494  chb A/A A/G G/G   0   0  17  17
rs9786896 8322494  jpt A/A A/G G/G   0   0  19  19
rs9786896 8322494  yri A/A A/G G/G   0   0  30  30
                                     0  21  75  96

rs9785941 13105848 ceu A/A A/G G/G  30   0   0  30
rs9785941 13105848 chb A/A A/G G/G  18   0   0  18
rs9785941 13105848 jpt A/A A/G G/G  22   0   0  22
rs9785941 13105848 yri A/A A/G G/G   2   0  28  30
                                    72   0  28 100

rs2032597 13856457 ceu A/A A/C C/C  21   0   8  29
rs2032597 13856457 chb A/A A/C C/C  21   0   0  21
rs2032597 13856457 jpt A/A A/C C/C  22   0   0  22
rs2032597 13856457 yri A/A A/C C/C  29   0   0  29
                                    93   0   8 101

rs2032605 13933534 ceu C/C C/T T/T  28   0   0  28
rs2032605 13933534 chb C/C C/T T/T  20   0   1  21
rs2032605 13933534 jpt C/C C/T T/T  21   0   0  21
rs2032605 13933534 yri C/C C/T T/T  30   0   0  30
                                    99   0   1 100

rs2032590 14028278 ceu T/T G/T G/G  29   0   0  29
rs2032590 14028278 chb T/T G/T G/G  20   0   0  20
rs2032590 14028278 jpt T/T G/T G/G  22   0   0  22
rs2032590 14028278 yri T/T G/T G/G  17   0   9  26
                                    88   0   9  97

rs2032624 14035089 ceu A/A A/C C/C   9   0  21  30
rs2032624 14035089 chb A/A A/C C/C  22   0   0  22
rs2032624 14035089 jpt A/A A/C C/C  22   0   0  22
rs2032624 14035089 yri A/A A/C C/C  30   0   0  30
                                    83   0  21 104

rs2032658 14590648 ceu G/G A/G A/A  21   0   9  30
rs2032658 14590648 chb G/G A/G A/A   0   0  18  18
rs2032658 14590648 jpt G/G A/G A/A   0   0  20  20
rs2032658 14590648 yri G/G A/G A/A   0   0  30  30
                                    21   0  77  98

rs3848982 20612392 ceu G/G A/G A/A  29   0   0  29
rs3848982 20612392 chb G/G A/G A/A  21   0   0  21
rs3848982 20612392 jpt G/G A/G A/A  12   0   9  21
rs3848982 20612392 yri G/G A/G A/A   0   0  30  30
                                    62   0  39 101

rs9306841 20674182 ceu C/C C/G G/G  29   0   0  29
rs9306841 20674182 chb C/C C/G G/G  20   0   0  20
rs9306841 20674182 jpt C/C C/G G/G  18   0   0  18
rs9306841 20674182 yri C/C C/G G/G   0   0  30  30
                                    67   0  30  97

rs2032612 20761675 ceu C/C C/T T/T  30   0   0  30
rs2032612 20761675 chb C/C C/T T/T  20   0   0  20
rs2032612 20761675 jpt C/C C/T T/T  16   0   2  18
rs2032612 20761675 yri C/C C/T T/T  30   0   0  30
                                    96   0   2  98

rs2032635 20784951 ceu A/A A/G G/G  19   0   9  28
rs2032635 20784951 chb A/A A/G G/G   0   0  22  22
rs2032635 20784951 jpt A/A A/G G/G   0   0  22  22
rs2032635 20784951 yri A/A A/G G/G   0   0  30  30
                                    19   0  83 102

rs2032652 20812497 ceu T/T C/T C/C  30   0   0  30
rs2032652 20812497 chb T/T C/T C/C  20   0   2  22
rs2032652 20812497 jpt T/T C/T C/C  10   0  12  22
rs2032652 20812497 yri T/T C/T C/C   0   0  30  30
                                    60   0  44 104

The first entry is the identification number for the SNP, and the second entry gives you the physical location of the SNP in base pairs. Notice that even though the Y chromosome is very small, there are still tens of millions of base pairs. The third entry identifies the group. The fourth, fifth, and sixth entries give the genotype for the SNP. The HapMap project uses two letters for consistency with the rest of the chromosome, but Y (and X in males) occurs by itself, so it is impossible to have a heterogenous SNP. The fourth SNP (rs9786896) has an obvious typo, since it includes counts for a heterogenous SNP.

An interesting question is whether the value of a base at particular SNPs provides any useful clues about whether a person might belong to CEU, CHB, JPT, or YRI. For a particular SNP to be helpful, it must show a range of variation. The first SNP (rs2534636) is T for 97 out of 98 people, and is unlikely to produce any valuable clues. In contrast, the second SNP (rs9786608) is C for 67 cases and T for 28 cases.

You can use a measure of uncertainty (sometimes called the entropy measure) to quantify the variation in SNP probabilities. The formula for uncertainty is

where pi represent the probabilities of the two base pairs (or the probabilities of the three possible allele combinations for paired genes). The uncertainty measures for the 15 SNPs are:

rs#       position H(Y)
rs2534636  2302306 0.08
rs9786608  2309073 0.87
rs3899     6994490 0.14
rs9786896  8322494 0.76
rs9785941 13105848 0.86
rs2032597 13856457 0.40
rs2032605 13933534 0.08
rs2032590 14028278 0.45
rs2032624 14035089 0.73
rs2032658 14590648 0.75
rs3848982 20612392 0.96
rs9306841 20674182 0.89
rs2032612 20761675 0.14
rs2032635 20784951 0.69
rs2032652 20812497 0.98

Notice that the uncertainty is largest for SNPs with the closest to 50-50 splits. The last SNP (rs2032652), for example, has 60 Ts and 44 Cs and has an uncertainty of 0.98.

So showing a range of variation, or a large value of uncertainty is helpful for distinguishing among the groups, but it is also important to show that this variation or uncertainty is dependent on the groups themselves. The concept of conditional uncertainty is helpful for quantifying this. The conditional uncertainty is computed in a fashion similar to the uncertainty measure described above, but you substitute conditional probabilities in the formula:

You can compute conditional uncertainty for each group. The results are presented below:

rs#       position group H(Y|x)
rs2534636  2302306   ceu 0.22
rs2534636  2302306   chb 0.00
rs2534636  2302306   jpt 0.00
rs2534636  2302306   yri 0.00

rs9786608  2309073   ceu 0.00
rs9786608  2309073   chb 0.00
rs9786608  2309073   jpt 0.00
rs9786608  2309073   yri 0.00

rs3899     6994490   ceu 0.00
rs3899     6994490   chb 0.00
rs3899     6994490   jpt 0.44
rs3899     6994490   yri 0.00

rs9786896  8322494   ceu 0.88
rs9786896  8322494   chb 0.00
rs9786896  8322494   jpt 0.00
rs9786896  8322494   yri 0.00

rs9785941 13105848   ceu 0.00
rs9785941 13105848   chb 0.00
rs9785941 13105848   jpt 0.00
rs9785941 13105848   yri 0.35

rs2032597 13856457   ceu 0.85
rs2032597 13856457   chb 0.00
rs2032597 13856457   jpt 0.00
rs2032597 13856457   yri 0.00

rs2032605 13933534   ceu 0.00
rs2032605 13933534   chb 0.28
rs2032605 13933534   jpt 0.00
rs2032605 13933534   yri 0.00

rs2032590 14028278   ceu 0.00
rs2032590 14028278   chb 0.00
rs2032590 14028278   jpt 0.00
rs2032590 14028278   yri 0.93

rs2032624 14035089   ceu 0.88
rs2032624 14035089   chb 0.00
rs2032624 14035089   jpt 0.00
rs2032624 14035089   yri 0.00

rs2032658 14590648   ceu 0.88
rs2032658 14590648   chb 0.00
rs2032658 14590648   jpt 0.00
rs2032658 14590648   yri 0.00

rs3848982 20612392   ceu 0.00
rs3848982 20612392   chb 0.00
rs3848982 20612392   jpt 0.99
rs3848982 20612392   yri 0.00

rs9306841 20674182   ceu 0.00
rs9306841 20674182   chb 0.00
rs9306841 20674182   jpt 0.00
rs9306841 20674182   yri 0.00

rs2032612 20761675   ceu 0.00
rs2032612 20761675   chb 0.00
rs2032612 20761675   jpt 0.50
rs2032612 20761675   yri 0.00

rs2032635 20784951   ceu 0.91
rs2032635 20784951   chb 0.00
rs2032635 20784951   jpt 0.00
rs2032635 20784951   yri 0.00

rs2032652 20812497   ceu 0.00
rs2032652 20812497   chb 0.44
rs2032652 20812497   jpt 0.99
rs2032652 20812497   yri 0.00

For some SNPs, the conditional uncertainties are larger than the (unconditional) uncertainty for one group and smaller for the other three groups. For other SNPs, the conditional uncertainties are all smaller than the unconditional uncertainty. You can aggregate the conditional uncertainties together using the formula:

This is just a weighted average of the conditional uncertainties. The difference between the conditional uncertainty and the unconditional uncertainty is called the mutual information.

This definition is quite intuitive. The mutual information represents the decrease in uncertainty in Y when we know the value of X. This definition of mutual information is symmetric, so if you define

then you will get the exact same answer

cc

rs#       position H(Y) H(Y|X) I(X;Y)
rs2534636  2302306 0.08   0.06   0.02
rs9786608  2309073 0.87   0.00   0.87
rs3899     6994490 0.14   0.09   0.04
rs9786896  8322494 0.76   0.28   0.48
rs9785941 13105848 0.86   0.11   0.75
rs2032597 13856457 0.40   0.24   0.16
rs2032605 13933534 0.08   0.06   0.02
rs2032590 14028278 0.45   0.25   0.20
rs2032624 14035089 0.73   0.25   0.47
rs2032658 14590648 0.75   0.27   0.48
rs3848982 20612392 0.96   0.20   0.76
rs9306841 20674182 0.89   0.00   0.89
rs2032612 20761675 0.14   0.09   0.05
rs2032635 20784951 0.69   0.25   0.44
rs2032652 20812497 0.98   0.30   0.68

Things left to do: Add commentary about SNPs with high information.

There is a lot of discussion about linkage disequilibrium, and these measures of joint and conditional uncertainty may provide some useful ways to assess the degree of disequilibrium. Here are a few references about disequilibrium that relate to the HapMap project:

Additional reference (added January 3, 2006)

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Information theory.