StATS: Haplotype analysis (January 13, 2006).

One of the people I work with wants to include a haplotype analysis in their research grant. I know nothing about haplotype analysis, so I am currently investigating various publications, web sites, and software. I want to include these resources here and eventually organize a web page that describes the statistical approach to haplotype analysis. I also think that there may be some benefit to using an information theory model in this type of analysis, but that is just some preliminary speculation on my part. I have looked a bit at this issue already while trying to understand the HapMap project.

The Wikipedia has a brief explanation of haplotypes:

A haplotype, a contraction of the phrase "haploid genotype", is the genetic constitution of an individual chromosome. In the case of diploid organisms such as humans, the haplotype will contain one member of the pair of alleles for each site. A haplotype can refer to only one locus or to an entire genome. A genome-wide haplotype would comprise half of a diploid genome, including one allele from each allelic gene pair. In a second meaning it refers to a set of single nucleotide polymorphisms (SNPs) found to be statistically associated on a single chromatid. With this knowledge, the identification of a few alleles of a haplotype block unambiguously identifies all other polymorphic sites in this region. Such information is most valuable to investigate the genetics behind common diseases and is collected by the International HapMap Project.

and a publication by Schaid et al 2002 outlines the basic problem

For unrelated subjects, haplotypes can be directly observed whenever there is no more than one heterozygous site. If there are H heterozygous sites, then the number of pairs of haplotypes that are consistent with the observed marker phenotypes is2H-1. Although the observations on codominant genetic markers are often referred to as “genotypes,” we shall refer to them as “marker phenotypes,” reserving the term “genotype” for when linkage phase is known. The traditional method to determine haplotypes is either pedigree analysis or molecular haplotyping (limited to short DNA sequences) (Michalatos-Beloin et al. 1996). Both of these methods require enormous work either to collect a sufficient number of pedigree members or to perform the necessary laboratory work. Although new genetic technology (e.g., conversion technology [Yan et al. 2000]) may improve molecular haplotyping, the current methods are not adequate for large-scale epidemiological studies of human traits. To account for ambiguous haplotypes among unrelated subjects, several algorithms—including a parsimony algorithm (Clark 1990), a Bayesian population genetic model that uses coalescent theory (Stephens et al. 2001b; Zhang et al. 2001), and maximum likelihood (Terwilliger and Ott 1994; Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long et al. 1995)—have been proposed. An advantage of the likelihood approach is that, in addition to the estimated haplotype frequencies, the posterior probabilities of the pairs of haplotypes that are consistent with the observed marker phenotypes can be computed for each subject.


A new statistical method for haplotype reconstruction from population data. Stephens M, Smith NJ, Donnelly P. Am J Hum Genet 2001: 68(4); 978-89. [Medline] [Abstract] [PDF]

Evidence for substantial fine-scale variation in recombination rates across the human genome. Crawford DC, Bhangale T, Li N, Hellenthal G, Rieder MJ, Nickerson DA, Stephens M. Nat Genet 2004: 36(7); 700-6. [Medline] [Abstract] [PDF]

Score tests for association between traits and haplotypes when linkage phase is ambiguous. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Am J Hum Genet 2002: 70(2); 425-34. [Medline] [Abstract] [PDF]

A comparison of bayesian methods for haplotype reconstruction from population genotype data. Stephens M, Donnelly P. Am J Hum Genet 2003: 73(5); 1162-9. [Medline] [Abstract] [PDF]

Software resources:

haplo.score. Score Tests for Association of Traits with Haplotypes when Linkage Phase is Ambiguous [PDF]. Rowland CM, Tines DE, Schaid DJ, Mayo Clinic. Accessed on 2006-01-13.

[Abstract] A suite of S-PLUS routines, referred to as "haplo.score", can be used to compute score statistics to test associations between haplotypes and a wide variety of traits, including binary, ordinal, quantitative, and Poisson. These methods assume that all subjects are unrelated and that haplotypes are ambiguous (due to unknown linkage phase of the genetic markers). The methods provide several different global and haplotype-specific tests for association, as well as provide adjustment for non-genetic covariates and computation of simulation p-values (which may be needed for sparse data).

Documentation for PHASE, version 2.1, June 2004 [PDF]. Stephens M, Smith NJ, Donnelly P, Li N, University of Washington. Accessed on 2006-01-13.

[My comments] This implements a Bayesian solution to the haplotype problem. More details about the software are at

There are at least four libraries in R that perform haplotype analysis:

There is also some software called statgene developed at the Mayo Clinic, but I could not find any details about this anywhere I looked on the web.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon and was last modified on 04/01/2010.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at