Stats: Guidelines for reliability/validity models (January 13, 2004)

Reliability/Validity (January 13, 2004)

This page is moving to a new website.

Dear Professor Mean, How do I show that a measurement has validity and reliability? Meek Mark

Dear Meek,

Just list Professor Mean as one of your co-authors. My reputation will establish validity and reliability all by itself.

Short Answer

The terms validity and reliability have different meanings to different people, so be sure you understand exactly what is needed for your particular discipline. As a general rule, reliability is established by showing a good level of correlation, concordance, or agreement between multiple assessments of the same measurement. For example, you could ask two people to rate the same individual patient and then compare the results. Or if the condition is stable over time, you might ask for assessments at two different times of your measurement. You establish validity by comparing your measurement to an external standard. The hard part is coming up with an appropriate standard.

Estimating reliability from repeated measurements by the same observer.

The simplest way to assess reliability is to ask an observer to measure the same patient twice. Common sense tells you to be careful here. Don't schedule the two measurements in such a way that the first measurement would influence the second measurement. Make sure the measurements are far enough apart in time that the observer is unlikely to remember at the time of the second measurement what the first measurement was. On the other hand, don't schedule the measurements so far apart that the patient has had enough time to substantially improve or worsen on the condition you are trying to measure.

The estimate of reliability in this situation uses a one factor random effects model. This model works well when the repeated measurements are more or less exchangeable.

Generally, you will find it easiest to get two measurements on the same patient, but if circumstances allow for a third or fourth measurement, the one factor random effects model allows you to use those data. You can also have a different number of measurements on different patients without any problem. The formulas get a bit more complex, but statistical software like SPSS does all the work for you.

Here's some data from a chiropractic assessment of 19 individual patients. The assessment was the millimeters of distance along the spine of a particular condition. This measurement was done twice on each patient by the same chiropractor. Here's a listing of the data (turned sideways to save space).

Here's a simple graph of the data.

Notice that some measurements are very consistent (patient 11 has measurements that are only 6 mm apart) and that other measurements are not so consistent. (patient 16 has measurements that are 120 mm apart, patient 17 has measurements that are 103 mm apart).

To quantify the degree of reliability of these measurements, select ANALYZE | GENERAL LINEAR MODEL | VARIANCE COMPONENTS from the SPSS menu.

The measurement itself is the dependent variable and the code for each individual patient is the random factor in this model. There are several options for computing variance components.

The ANOVA option is the easiest method, but can sometimes lead to negative variance estimates, which are a bit difficult to interpret. In practice, a negative estimate is unlikely to occur in a reliability situation unless the measurement totally lacks any level of reliability.

Although SPSS does not directly produce an estimate of the intraclass correlation coefficient the formula is Var(subj) / (Var(subj) + Var(Error)). In this case it works out to be 1738 / (1738+1150) = 0.6.

As a general rule of thumb (Shoukri and Edge 1996), a reliability coefficient (r) is

Excellent if r is larger than 0.75,
Good if r is between 0.4 and 0.75,
Poor if r is less than 0.4.

So this measurement has good reliability, but not excellent.

On your own

Three additional chiropractors evaluated patients on two separate occasions. Here are the data.

Estimate a separate reliability coefficient for each of the three chiropractors.

Estimating reliability from measurements by multiple observers.

The data shown above is interesting because you can re-organize it to represent another common method for reliability, having multiple observers provide a measurement for each patient.

Here is the data.

Here is a graph of this data.

This data represents a two factor random effects model, since you have to account for variability between raters and variability between subjects.

You have to specify a model without an interaction, because there is not enough data to estimate this interaction. Click on the MODEL button to do this.

You should specify a CUSTOM model and place the SUBJ and EV variables in the model and avoid specifying an interaction. When you have done this click on the CONTINUE button and the OK button in the previous dialog box.

The output from SPSS gives the following three variance components.

The intraclass correlation is computed as Var(subj) / (Var(subj) + Var(ev) + Var(Error)). In this example, it would be 1538 / (1538 + 170 + 1943) = 0.42.

On your own

Here is the other half of the chiropractic reliability data. Calculate the reliability among these four observer.

Further reading

Age of First Use: Its Reliability and Predictive Utility. Labouvie E, Bates ME, Pandina RJ. Journal of Studies on Alcohol 1997: 58(6); 638-643.

Assessment of trans-fatty acid intake with a food frequency questionnaire and validation with adipose tissue levels of trans-fatty acids. Lemaitre RN, King IB, Patterson RE, Psaty BM, Kestin M, Heckbert SR. Am J Epidemiol 1998: 148(11); 1085-93.

Caffeine reversal of sleep deprivation effects on alertness and mood. Penetar D, McCann U, Thorne D, Kamimori G, Galinski C, Sing H, Thomas M, Belenky G. Psychopharmacology (Berl) 1993: 112(2-3); 359-65. [Medline]

Characteristics of individuals and long term reproductibility of dietary reports: the Tecumseh Diet Methodology Study. Thompson F, Metzner H, Lamphiear D, Hawthorne V. J Clin Epidemiol 1990: 43(11); 1169-78.

Clinician agreement on physical findings in child sexual abuse cases. Sinal SH, Lawless MR, Rainey DY, Everett VD, Runyan DK, Frothingham T, Herman-Giddens M, St Claire K. Arch Pediatr Adolesc Med 1997: 151(5); 497-501.

Comparison of food frequency questionnaires: the reduced block and Willett questionnaires differ in ranking on nutrient intakes. Wirfait A, Jeffery R, Elmer P. AJE 1998: 148(12); 1148-56.

A comparison of patient and proxy symptom assessments in advanced cancer patients. Nekolaichuk CL, Bruera E, Spachynski K, MacEachern T, Hanson J, Maguire TO. Palliat Med 1999: 13(4); 311-23. [Medline]

Comparison of the Block and the Willett self-administered semiquantitative food frequency questionnaires with an interviewer-administered dietary history. Caan B, Slattery M, Potter J, Quesenberry CJ, Coates A, Schaffer D. AJE 1998: 148(12); 1137-47.

The concordance correlation coefficient estimated through variance components [PDF]. Carrasco J, Jover J. Accessed on 2003-12-1. www.udc.es/dep/mate/biometria2003/Archivos/ot76.pdf

Construct Validity in Psychological Tests. Cronbach LJ. Psychological Bulletin 1955: 52281-302.

Construct Validity of the MiniClinical Evaluation Exercise (MiniCEX). Holmboe ES, Huot S, Chung J, Norcini J, Hawkins RE. Acad Med 2003: 78(8); 826-830. [Medline] [Abstract] [Full text] [PDF]

The CRIB (clinical risk index for babies) score: a tool for assessing initial neonatal risk and comparing performance of neonatal intensive care units. The International Neonatal Network. No authors listed. Lancet 1993: 342(8865); 193-8.

The CRIB (clinical risk index for babies) score: a tool for assessing initial neonatal risk and comparing performance of neonatal intensive care units. The International Neonatal Network. Unknown A. Lancet 1993: 342(8865); 193-8.

Depth of sedation in children undergoing computed tomography: validity and reliability of the University of Michigan Sedation Scale (UMSS). Malviya S, Voepel-Lewis T, Tait AR, Merkel S, Tremper K, Naughton N. Br J Anaesth 2002: 88(2); 241-5. [Medline]

Developing a scale for measuring the barriers to condom use in Nigeria. Sunmola AM. Bull World Health Organ 2001: 79(10); 926-32.

Differential recall bias and spurious associations in case/control studies. Barry D. Statistics in Medicine 1996: 15(23); 2603-16. [Medline]

Do health interview surveys yield reliable data on chronic illness among older respondents? Beckett M, Weinstein M, Goldman N, Yu-Hsuan L. American Journal of Epidemiology 2000: 151(3); 315-23.

Empirical Evidence of Correlated Biases in Dietary Assessment Instruments and Its Implications. Kipnis V, Midthune D, Freedman LS, Bingham S, Schatzkin A, Subar A, Carroll RJ. Am. J. Epidemiol. 2001: 153(4); 394-403.

Epidemiological assessment of diet: a comparison of a 7-day diary with a food frequency questionnaire using urinary markers of nitrogen, potassium and sodium. Day N, McKeown N, Wong M, Welch A, Bingham S. Int J Epidemiol 2001: 30(2); 309-17.

Evaluating Novel Cardiovascular Risk Factors: Can We Better Predict Heart Attacks? Ridker PM. Annals of Internal Medicine 1999: 130(11); 933-937.

Examination of instruments used to rate quality of health information on the internet: chronicle of a voyage with an unclear destination. Gagliardi A, Jadad AR. Bmj 2002: 324(7337); 569-73. [Medline] [Abstract] [Full text] [PDF]

The Forer effect (a.k.a. the P.T. Barnum effect and subjective validation). Carroll RT, The Skeptic's Dictionary. Accessed on 2003-03-10. www.skepdic.com/myersb.html

Forming inferences about some intraclass correlation coefficients. McGraw KO, Wong SP. Psychological Methods 1996: 130-46 error corrections - 390.

Genetic test evaluation: information needs of clinicians, policy makers, and the public. Burke W, Atkins D, Gwinn M, Guttmacher A, Haddow J, Lau J, Palomaki G, Press N, Richards CS, Wideroff L, Wiesner GL. Am J Epidemiol 2002: 156(4); 311-8. [Medline]

Instrument Validity.. The Royal Windsor Society for Nursing Research. Accessed on 2003-www.kelcom.igs.net/~nhodgins/instrument_validity.html

The intraclass correlation coefficient as a measure of reliability. Bartko JJ. Psychological Reports 1966: 19(1); 3-11. [Medline]

Intraclass Correlations: Uses in Assessing Rater Reliability. Shrout PE, Fleiss JL. Psychological Bulletin 1979: 86(2); 420-28.

Measures of agreement: a single procedure. Bartko JJ. Statistics in Medicine 1994: 13(5-7); 737-45. [Medline]

Misclassification rates for current smokers misclassified as nonsmokers. Wells A, English P, Posner S, Wagenknecht L, Perez-Stable E. American Journal of Public Health 1998: 88(10); 1503-09.

Modeling Smoking History: A Comparison of Different Approaches. Leffondre K, Abrahamowicz M, Siemiatycki J, Rachet B. Am. J. Epidemiol. 2002: 156(9); 813-823.

The Mozart Effect. Carroll RT, The Skeptic's Dictionary. Accessed on 2003-06-09. skepdic.com/mozart.html

Myers-Briggs Type Indicator�. Carroll RT, The Skeptic's Dictionary. Accessed on 2003-03-10. www.skepdic.com/myersb.html

The Mystery of the Mozart Effect: Failure to Replicate. Stelle KM, Bass KE, Crook MD. Psychological Science 1999: 10(4); 366-369. [PDF]

A new index of prognostic severity for chronic asthma. Ellman MS, Viscoli CM, Sears MR, Taylor DR, Beckett WS, Horwitz RI. Chest 1997: 112(3); 582-90. [Medline]

On Various Intraclass Correlation Reliability Coefficients. Bartko JJ. Psychological Bulletin 1976: 83(5); 762-765.

A patient survey system to measure quality improvement: questionnaire reliability and validity. Carey RG, Seibert JH. Med Care 1993: 31(9); 834-45.

Patterns of psychiatric morbidity in a genito-urinary clinic. A validation of the Hospital Anxiety Depression scale (HAD). Barczak P, Kane N, Andrews S, Congdon AM, Clay JC, Betts T. Br J Psychiatry 1988: 152698-700. [Medline]

A Plethora of Threats: A Mildly Amusing Guide for the Weary Student and Anyone Else Encountering the How To's and What If's of Construct Validity.. Driebe NM. Accessed on 2003-09-17. trochim.human.cornell.edu/tutorial/driebe/tweb1.htm

Problems in using the hospital anxiety and depression scale for screening patients in general practice. Dowell AC, Biran LA. Br J Gen Pract 1990: 40(330); 27-8. [Medline]

Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers. Schoonhoven L, Haalboom JR, Bousema MT, Algra A, Grobbee DE, Grypdonck MH, Buskens E. Bmj 2002: 325(7368); 797. [Medline]

Rating health information on the Internet: navigating to knowledge or to Babel? Jadad AR, Gagliardi A. Jama 1998: 279(8); 611-4. [Medline] [Abstract] [Full text] [PDF]

Rationale, interpretation, validation, and uses of sperm function tests. Muller C. Journal of Andrology 2000: 21(1); 10-30. [Medline]

Reliability and validity of the Children's Health Survey for Asthma. Asmussen L, Olson LM, Grant EN, Fagan J, Weiss KB. Pediatrics 1999: 104(6); e71. [Medline]

Reliability and validity of the Women's Health Initiative Insomnia Rating Scale. Levine DW, Kripke DF, Kaplan RM, Lewis MA, Naughton MJ, Bowen DJ, Shumaker SA. Psychol Assess 2003: 15(2); 137-48. [Medline]

Reliability and Validity: Meaning and Measurement [PDF]. Lewis R, J., Ph.D., Annual Meeting Of the Society for Academic Emergency Medicine. Accessed on 2004-01-12. www.ambpeds.org/ReliabilityandValidity.pdf

Reliability of death certificate diagnoses. Moussa MA, Shafie MZ, Khogali MM, el-Sayed AM, Sugathan TN, Cherian G, Abdel-Khalik AZ, Garada MT, Verma D. J Clin Epidemiol 1990: 43(12); 1285-95. [Medline]

The reliability of medical record review for estimating adverse event rates. Thomas E, Studdert D, Brennan T. Ann Intern Med 2002: 136(11); 812-816. [Medline]

The reliability of medical record review for estimating adverse event rates. Thomas E, Studdert DM, Brennan TA. Ann Intern Med 2002: 136(11); 812-816.

Reporting accuracy among mothers of malformed and nonmalformed infants. Werler MM, Pober BR, Nelson K, Holmes LB. Am J Epidemiol 1989: 129(2); 415-21. [Medline]

Reporting on quality of life in randomised controlled trials: bibliographic study. Sanders C, Egger M, Donovan J, Tallon D, Frankel S. Bmj 1998: 317(7167); 1191-4. [Medline] [Abstract] [Full text] [PDF]

Reporting on quality of life in RCTs. Wright SP. British Medical Journal 1999: 318(7191); 1142. [Full text]

Research Fables from the Sisters Grinn, No. 2. Snow White and the Seven Threats to Validity.. Grace J, University of Rochester School of Nursing. Accessed on 2003-05-27. www.urmc.rochester.edu/SON/Fables/snowht.htm

A review of validations of dietary assessment methods. Block G. American Journal of Epidemiology 1982: 115(4); 492-505. [Medline]

Sample Size Requirements for Precise Estimates of Reliability, Generalizability, and Validity Coefficients. Charter RA. Journal of Clinical and Experimental Neuropsychology 1999: 21(4); 559-566.

Score for Neonatal Acute Physiology: a physiologic severity index for neonatal intensive care. Richardson DK, Gray JE, McCormick MC, Workman K, Goldmann DA. Pediatrics 1993: 91(3); 617-23.

Technical Support � Statistical Macro Library. SPSS Inc. Accessed on 2003-11-06. www.spss.com/tech/stat/macros/Iccsf.htm

Treatment of acute childhood diarrhea with homeopathic medicine: a randomized clinical trial in Nicaragua. Jacobs J, Jimenez LM, Gloyd SS, Gale JL, Crothers D. Pediatrics 1994: 93(5); 719-25.

True scores, error, reliability, and unit of analysis in environment and behavior research. Levine D. Environment and Behavior 1994: 26(2); 261-92.

Types of Reliability. Trochim WMK. Accessed on 2003-trochim.human.cornell.edu/kb/reltypes.htm

Underascertainment of child maltreatment fatalities by death certificates, 1990-1998. Crume TL, DiGuiseppi C, Byers T, Sirotnak AP, Garrett CJ. Pediatrics 2002: 110(2 Pt 1); e18 (1 - 6). [Medline] [Abstract] [Full text] [PDF]

Validation of 2 pain scales for use in the pediatric emergency department. Bulloch B, Tenenbein M. Pediatrics 2002: 110(3); e33. [Medline]

Validation of a short telephone administered questionnaire to evaluate dietary interventions in low income communities in Montreal, Canada. Gray-Donald K, O'Loughlin J, Richard L, Paradis G. Journal of Epidemiology and Community Health 1997: 51(3); 326-331.

Validation of an Autoscoring Algorithm to Detect Obstructive Sleep Apnea [PDF]. Coyle M, Carter G, Horsager A, Mendelson W. Accessed on 2003-12-01. pdf.vivometrics.com/vivo/APSSPoster061003.pdf

Validation of Analytical Methods. Taylor JK. Analytical Chemistry 1983:

Validation of the Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable angina. Lewin RJ, Thompson DR, Martin CR, Stuckey N, Devlen J, Michaelson S, Maguire P. J Cardiopulm Rehabil 2002: 22(3); 184-91. [Medline]

Validity of Children's Food Portion Estimates: A Comparison of 2 Measurement Aids. Matheson DM, Hanson KA, McDonald TE, Robinson TN. Arch Pediatr Adolesc Med 2002: 156(9); 867-71. [Medline]

Validity of Maternal Report of Prenatal Alcohol, Cocaine, and Smoking in Relation to Neurobehavioral Outcome. Jacobson SW, Chiodo LM, Sokol RJ, Jacobson JL. Pediatrics 2002: 109(5); 815-825. [Abstract]

The validity of the Hospital Anxiety and Depression Scale. An updated literature review. Bjelland I, Dahl AA, Haug TT, Neckelmann D. J Psychosom Res 2002: 52(2); 69-77. [Medline]

Validity study of the severity index, a simple measure of urinary incontinence in women [In Process Citation]. Hanley J, Capewell A, Hagen S. British Medical Journal 2001: 322(7294); 1096-7.

A validity test of movie, television, and video-game ratings. Walsh DA, Gentile DA. Pediatrics 2001: 107(6); 1302-8.

The visual analogue pain intensity scale: what is moderate pain in millimetres? Collins SL, Moore RA, McQuay HJ. Pain 1997: 72(1-2); 95-7.

What's Wrong with This Picture? Lilienfeld SO, Wood JM, Garb HN. Scientific American 2001: 284(5); 80-87. [Medline] [PDF]