P.Mean >> Category >> Measuring agreement (created 2007-07-28).

There are several ways to calculate the degree of agreement between two variables that are purporting to measure the same thing. In addition to describing these measures, this category includes discussion of assessment of reliability and validity, which is typically done by establishing a strong degree of agreement. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page. Other entries about measuring agreement can be found in the measuring agreement page at the StATS website.


10. P.Mean: Reliable diagnosis of cataracts (created 2008-07-28). Can you help with this question? Cataracts of the eye may be difficult to diagnose, especially in the early stages. In a study of the reliability of their diagnoses, two physicians each examined the same 1,000 eyes, without knowing the other's diagnoses. Each physician found 100 eyes with cataracts. Does this mean that the diagnoses are reliable?

Outside resources:

Philip Schluter. A multivariate hierarchical Bayesian approach to measuring agreement in repeated measurement method comparison studies. BMC Medical Research Methodology. 2009;9(1):6. Description: This paper considers a Bayesian extension to the Bland-Altman chart to incorporate repeatability assessment as well as agreement among three or more methods. Abstract: "BACKGROUND: Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data. METHODS: Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates). RESULTS: We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model. CONCLUSIONS: These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS." [Accessed January 30, 2009]. Available at: http://www.biomedcentral.com/1471-2288/9/6.

Helena Chmura Kraemer. Correlation coefficients in medical research: from product moment correlation to the odds ratio. Stat Methods Med Res. 2006;15(6):525-45. Description: There are several measures of agreement (such as the phi coefficient, the point biserial correlation, and the tetrachoric correlation) that are used to show relationships when one or both variables are binary. This paper shows the interrelationships and the interpretation of these correlations and relates them to other measures not traditionally thought of as measures of correlation, such as the odds ratio. Abstract: "OBJECTIVE: Presentation of effect sizes that can be interpreted in terms of clinical or practical significance is currently urged whenever statistical significance (a 'p-value') is reported in research journals. However, which effect size and how to interpret it are not yet clearly delineated. The present focus is on effect sizes indicating strength of correlation, that is, effect sizes that describe the strength of monotonic association between two random variables X and Y in a population. METHODS: A logical structure of measures of association is traced, showing the interrelationships among the many measures of association. Advantages and disadvantages of each are discussed. CONCLUSIONS: Suggestions are made for the future use of measures of association in research to facilitate considerations of clinical significance, emphasizing distribution-free effect sizes such as the Spearman correlation coefficient and Kendall's coefficient of concordance for ordinal versus ordinal associations, weighted and intraclass kappa for binary versus binary associations and risk difference (RD) for binary versus ordinal association." [Accessed January 30, 2009]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/17260922.

Robin Pokrzywinski, David Meads, Stephen McKenna, G Glendenning, Dennis Revicki. Development and psychometric assessment of the COPD and Asthma Sleep Impact Scale (CASIS). Health and Quality of Life Outcomes. 2009;7(1):98. Abstract: "BACKGROUND: Patients with respiratory disease experience disturbed sleep, but there is no widely accepted measure of sleep impairment due to respiratory disease. We developed and evaluated the psychometric performance of a patient-reported measure to assess the impact on sleep due to respiratory disease, the COPD and Asthma Sleep Impact Scale (CASIS). METHODS: Identification of the items forming the CASIS was guided by patient interviews and focus groups. An observational study involving patients from the US and UK was then conducted to assess the psychometric characteristics of the measure. RESULTS: Qualitative data from 162 patients were used to develop the CASIS (n=78 COPD; n=84 asthma). The observational study included 311 patients with COPD and 324 patients with asthma. The final seven items used in the CASIS were identified based on factor and item response theory analyses. Internal consistency was 0.90 (COPD) and 0.92 (asthma), and test-retest reliability was 0.84 (both groups). In the COPD sample, CASIS scores were significantly correlated with SGRQ scores (all p<0.0001) and differed significantly by patient-reported disease severity, exacerbation status, and overall health status (all p[less than or equal to]0.005). In the asthma sample, CASIS scores were significantly correlated with AQLQ scores (all p<0.0001) and differed significantly by clinician and patient-reported disease severity, exacerbation status, and overall health status (all p[less than or equal to]0.0005). CONCLUSIONS: The CASIS shows good internal consistency, test-retest reliability, and construct validity and may be useful in helping to understand the impact that COPD and asthma have on sleep outcomes." [Accessed December 15, 2009]. Available at: http://www.hqlo.com/content/7/1/98.

I. Griffin, N. Pang, J. Perring, R. Cooke. Knee-heel length measurement in healthy preterm infants. Arch Dis Child Fetal Neonatal Ed. 1999;81(1):F50–F55. Description: This article provides an illustrative example of how to use the coefficient of variation to measure agreement on a continuous trait among several raters. Abstract: "AIM: To examine the reproducibility of crown-heel length measurement; the precision and reproducibility of knee-heel length measurement; and the association between the two in healthy preterm infants. METHODS: Paired crown-heel and knee-heel lengths were measured on 172 occasions by three observers in 43 preterm infants between 205and 458 days of postconceptional age. RESULTS: Crown-heel length (CHL) measurement was highly reproducible, with a coefficient of variation (CV) of 0.41%. Knee-heel length (KHL) measurement was relatively precise (CV 0.78%), but less reproducible (intra-observer CV 1.77%, intra-observer CV 2.11%), especially in larger infants. The association between KHL and CHL was not consistent and varied with age. KHL was a poor predictor of CHL, with a 95% predictive interval of ± 27.5mm. CONCLUSIONS: KHL was less reproducible than CHL, especially in larger infants, and a poor predictor of CHL." [Accessed January 30, 2009]. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1720961&rendertype=abstract.

Journal article: Chris J Gibbons, Roger J Mills, Everard W Thornton, John Ealing, John D Mitchell, Pamela J Shaw, Kevin Talbot, Alan Tennant, Carolyn A Young. Rasch analysis of the Hospital Anxiety and Depression Scale (HADS) for use in motor neurone disease Health and Quality of Life Outcomes. 2011;9(1):82. ABSTRACT: "BACKGROUND: The Hospital Anxiety and Depression Scale (HADS) is commonly used to assess symptoms of anxiety and depression in motor neurone disease (MND). The measure has never been specifically validated for use within this population, despite questions raised about the scale's validity. This study seeks to analyse the construct validity of the HADS in MND by fitting its data to the Rasch model. METHODS: The scale was administered to 298 patients with MND. Scale assessment included model fit, differential item functioning (DIF), unidimensionality, local dependency and category threshold analysis. RESULTS: Rasch analyses were carried out on the HADS total score as well as depression and anxiety subscales (HADS-T, D and A respectively). After removing one item from both of the seven item scales, it was possible to produce modified HADS-A and HADS-D scales which fit the Rasch model. An 11-item higher-order HADS-T total scale was found to fit the Rasch model following the removal of one further item. CONCLUSION: Our results suggest that a modified HADS-A and HADS-D are unidimensional, free of DIF and have good fit to the Rasch model in this population. As such they are suitable for use in MND clinics or research. The use of the modified HADS-T as a higher-order measure of psychological distress was supported by our data. Revised cut-off points are given for the modified HADS-A and HADS-D subscales." [Accessed on October 11, 2011].

G. David Carson. Reliability Analysis: Statnotes, from North Carolina State University, Public Administration Program. Excerpt: "Researchers must demonstrate instruments are reliable since without reliability, research results using the instrument are not replicable, and replicability is fundamental to the scientific method. Reliability is the correlation of an item, scale, or instrument with a hypothetical one which truly measures what it is supposed to. Since the true instrument is not available, reliability is estimated in one of four ways: 1. Internal consistency: Estimation based on the correlation among the variables comprising the set (typically, Cronbach's alpha). 2. Split-half reliability: Estimation based on the correlation of two equivalent forms of the scale (typically, the Spearman-Brown coefficient). 3. Test-retest reliability: Estimation based on the correlation between two (or more) administrations of the same item, scale, or instrument for different times, locations, or populations, when the two administrations do not differ on other relevant variables (typically, the Spearman Brown coefficient). 4. Inter-rater reliability: Estimation based on the correlation of scores between/among two or more raters who rate the same item, scale, or instrument (typically, intraclass correlation, of which there are six types discussed below). These four reliability estimation methods are not necessarily mutually exclusive, nor need they lead to the same results. All reliability coefficients are forms of correlation coefficients, but there are multiple types discussed below, representing different meanings of reliability and more than one might be used in single research setting. " [Accessed January 1, 2010]. Available at: http://faculty.chass.ncsu.edu/garson/PA765/reliab.htm.

Kilem Li Gwet. Research Papers on Inter-Rater Reliability Estimation. Excerpt: "Below are some downloadable research papers published by Dr. Gwet on Inter-Rater Reliability. They are all in PDF format." [Accessed July 7, 2010]. Available at: http://www.agreestat.com/research_papers.html.

EDF 5841 Methods of Educational Research. Guide 3: Reliability, Validity, Causality, and Experiments. Susan Carol Losh, Florida State University, September 11, 2001. Description: This webpage provides simple definitions of terms commonly used in educational research such as construct validity and discusses how to establish a causal relationship. URL: edf5481-01.fa01.fsu.edu/Guide3.html

The Tetrachoric and Polychoric Correlation Coefficients. John Uebersax. Excerpt: This page describes the tetrachoric and polychoric correlation coefficients, explains their meaning and uses, gives examples and references, provides programs for their estimation, and discusses other available software. While discussion is primarily oriented to rater agreement problems, it is general enough to apply to most other uses of these statistics. This website was last verified on 2008-URL: ourworld.compuserve.com/homepages/jsuebersax/tetra.htm

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-07-07. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


9. Stats: Cohen's Kappa with small cell sizes (April 26, 2007). Someone on Edstat-L wrote in asking about using Cohen' Kappa with a small sample size in some of the cells.

8. Stats: What is an adequate sample size for establishing validity and reliability? (April 9, 2007). Someone from Mumbai, India wrote in asking whether a sample of 163 was sufficiently large for a study of reliability and validity. This was for a project that was already done, and this person was worried that someone would complain that 163 is too small.


7. Stats: What is construct validity? (March 8, 2006). Someone asked me to define face validity, criterion validity, and construct validity. That's a tall order. In general, validity means that a measurement that we take represents what we think it should. This is important to establish, because many times we think we are measuring one thing, but we are measuring something else entirely. It is important to remember that validity is a journey and not a goal. You never reach a place called the land of valid measurements. Instead, you gradually strengthen the evidence for validity, but there is no threshold that you cross where you can say, "We can now conclude that the measure is valid." Similarly, there is no region we can point to where we can say with confidence "We have not yet reached the point where we can say that the measure is valid."


6. Stats: Very low values from Cronbach's Alpha (July 19, 2005). Someone came to me with an analysis of a scale of 16 items, where the Cronbach's alpha computed for that scale was only 0.14. The first thing you should look for in this situation is whether some of the items on the scale are reverse scaled.

5. Stats: Measuring agreement (April 19, 2005). Someone reviewing a paper asked me about all the "weird statistics" being used in the paper, such as the Bland-Altman plot and Deming regression.


4. Stats: What's a good value for Cronbach's Alpha? (September 9, 2004). Someone sent me a question a month ago that I never got around to responding to. She asks, What would you consider a Cronbach alpha of .60 to be in terms of “label” (i.e., fair, poor, etc.)?

3. Stats: Goodness of fit test (May 18, 2004). The chi-square test appears in a lot of different places. Some recent data on Astrology, published in the May newsletter of the Skeptic Society, offers an interesting opportunity to show one of these tests.

2. StATS: Reliability/Validity (January 13, 2004). Dear Professor Mean, How do I show that a measurement has validity and reliability? Meek Mark


1. Stats: Establishing validity and reliability (November 6, 2002). Dear Professor Mean, I need to establish validity and reliability of a new measurement. How do I do this? --- Flustered Fred


  1. Stats: What is a point biserial correlation?
  2. Stats: What is a correlation? (Pearson correlation)
  3. Stats: What is a Kappa coefficient? (Cohen's Kappa)
  4. Stats: What are odds?
  5. Stats: What is an odds ratio?
  6. Stats: What is a phi coefficient?

What now?

Browse other categories at this site

Browse through the most recent entries

Get help