StATS: Establishing validity and reliability (November 6, 2002)

Dear Professor Mean, I need to establish validity and reliability of a new measurement. How do I do this? --- Flustered Fred

Dear Flustered. Validity and reliability are two very dangerous words to use, because they mean different things to different people. Here are some issues to consider.

Face validity

Face validity is the extent to which your measurement appears valid to the general public. Does the measurement make intuitive sense? A closely related concept is content validity, the extent to which your measurement makes intuitive sense to a group of experts in the area.

Construct validity

Construct validity is establishing that you are measuring what you think you are measuring. You do this in two very different ways. First you establish that your measure correlates well with what it should correlate well with (convergent validity) and is uncorrelated with what it should be uncorrelated with (discriminant validity).

A Plethora of Threats: A Mildly Amusing Guide for the Weary Student and Anyone Else Encountering the How To's and What If's of Construct Validity. Nicole M. Driebe. Accessed on November 4, 2002.

Criterion validity

Criterion validity is the extent to which your measurement correlates with a gold standard. You can establish this two ways. First, you can see if your measure predicts a future measure that indicates the same thing (predictive validity). Second, you can see if your measure correlates well with other concurrent measures which have been established as gold standards.


Two large hospitals in the Netherlands studied four risk assessment scales for pressure ulcers (BMJ 2002; 325: 797 (12 October)). Four separate scales were applied to patients without pressure ulcers who were expected to stay at least five days in the hospital. These patients were then followed for up to 12 weeks to see if a pressure ulcer did develop. They showed that all four scales had poor predictive validity--they were unable to make accurate predictions of which patients would develop pressure ulcers. The authors noted that there had been minimal evaluation of these scales beyond expert opinion and literature review.

Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers. Lisette Schoonhoven, Jeen R E Haalboom, Mente T Bousema, Ale Algra, Diederick E Grobbee, Maria H Grypdonck, and Erik Buskens. BMJ 2002; 325: 797. [Abstract] [Abridged text] [Abridged PDF] [Full text] [PDF]

Further reading

  1. Asmussen, L., L. M. Olson, et al. (1999). "Reliability and validity of the Children's Health Survey for Asthma." Pediatrics 104(6): e71.
  2. Barry, D. (1996). "Differential recall bias and spurious associations in case/control studies." Statistics in Medicine 15(23): 2603-16.
  3. Bartko, J. J. (1966). "The intraclass correlation coefficient as a measure of reliability." Psychological Reports 19(1): 3-11.
  4. Bartko, J. J. (1976). "On Various Intraclass Correlation Reliability Coefficients." Psychological Bulletin 83(5): 762-765.
  5. Bartko, J. J. (1994). "Measures of agreement: a single procedure." Statistics in Medicine 13(5-7): 737-45.
  6. Beckett, M., M. Weinstein, et al. (2000). "Do health interview surveys yield reliable data on chronic illness among older respondents?" American Journal of Epidemiology 151(3): 315-23.
  7. Block, G. (1982). "A review of validations of dietary assessment methods." American Journal of Epidemiology 115(4): 492-505.
  8. Bulloch, B. and M. Tenenbein (2002). "Validation of 2 pain scales for use in the pediatric emergency department." Pediatrics 110(3): e33.
  9. Caan, B., M. Slattery, et al. (1998). "Comparison of the Block and the Willett self-administered semiquantitative food frequency questionnaires with an interviewer-administered dietary history." AJE 148(12): 1137-47.
  10. Carey, R. G. and J. H. Seibert (1993). "A patient survey system to measure quality improvement: questionnaire reliability and validity." Med Care 31(9): 834-45.
  11. Carroll, R. T. The Forer effect (a.k.a. the P.T. Barnum effect and subjective validation).
  12. Carroll, R. T. The Mozart Effect. The Mozart Effect is a term coined by Alfred A. Tomatis for the alleged increase in brain development that occurs in children under age 3 when they listen to the music of Wolfgang Amadeus Mozart.
  13. Carroll, R. T. Myers-Briggs Type Indicator®.
  14. Collins, S. L., R. A. Moore, et al. (1997). "The visual analogue pain intensity scale: what is moderate pain in millimetres?" Pain 72(1-2): 95-7.
  15. Coughlin, S. S. (1990). "Recall bias in epidemiologic studies." J Clin Epidemiol 43(1): 87-91.
  16. Cronbach, L. J. (1955). "Construct Validity in Psychological Tests." Psychological Bulletin 52: 281-302.
  17. Crume, T. L., C. DiGuiseppi, et al. (2002). "Underascertainment of child maltreatment fatalities by death certificates, 1990-1998." Pediatrics 110(2 Pt 1): e18 (1 - 6).
  18. Day, N., N. McKeown, et al. (2001). "Epidemiological assessment of diet: a comparison of a 7-day diary with a food frequency questionnaire using urinary markers of nitrogen, potassium and sodium." Int J Epidemiol 30(2): 309-17.
  19. Driebe, N. M. A Plethora of Threats: A Mildly Amusing Guide for the Weary Student and Anyone Else Encountering the How To's and What If's of Construct
  20. Ellman, M. S., C. M. Viscoli, et al. (1997). "A new index of prognostic severity for chronic asthma." Chest 112(3): 582-90.
  21. Grace, J. Research Fables from the Sisters Grinn, No. 2. Snow White and the Seven Threats to Validity.
  22. Gray-Donald, K., J. O'Loughlin, et al. (1997). "Validation of a short telephone administered questionnaire to evaluate dietary interventions in low income communities in Montreal, Canada." Journal of Epidemiology and Community Health 51(3): 326-331.
  23. Hanley, J., A. Capewell, et al. (2001). "Validity study of the severity index, a simple measure of urinary incontinence in women [In Process Citation]." British Medical Journal 322(7294): 1096-7.
  24. Jacobs, J., L. M. Jimenez, et al. (1994). "Treatment of acute childhood diarrhea with homeopathic medicine: a randomized clinical trial in Nicaragua." Pediatrics 93(5): 719-25.
  25. Jacobson, S. W., L. M. Chiodo, et al. (2002). "Validity of Maternal Report of Prenatal Alcohol, Cocaine, and Smoking in Relation to Neurobehavioral Outcome." Pediatrics 109(5): 815-825.
  26. Kipnis, V., D. Midthune, et al. (2001). "Empirical Evidence of Correlated Biases in Dietary Assessment Instruments and Its Implications." Am. J. Epidemiol. 153(4): 394-403.
  27. Labouvie, E., M. E. Bates, et al. (1997). "Age of First Use: Its Reliability and Predictive Utility." Journal of Studies on Alcohol 58(6): 638-643.
  28. Leffondre, K., M. Abrahamowicz, et al. (2002). "Modeling Smoking History: A Comparison of Different Approaches." Am. J. Epidemiol. 156(9): 813-823.
  29. Lemaitre, R. N., I. B. King, et al. (1998). "Assessment of trans-fatty acid intake with a food frequency questionnaire and validation with adipose tissue levels of trans-fatty acids." Am J Epidemiol 148(11): 1085-93.
  30. Levine, D. (1994). "True scores, error, reliability, and unit of analysis in environment and behavior research." Environment and Behavior 26(2): 261-92.
  31. Levine, D. W., D. F. Kripke, et al. (2003). "Reliability and validity of the Women's Health Initiative Insomnia Rating Scale." Psychol Assess 15(2): 137-48.
  32. Lewin, R. J., D. R. Thompson, et al. (2002). "Validation of the Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable angina." J Cardiopulm Rehabil 22(3): 184-91.
  33. Lilienfeld, S. O. (2001). "What's Wrong with This Picture? (Inkblot Test)." Scientific American: 81 -87.
  34. Lilienfeld, S. O., J. M. Wood, et al. (2001). "What's Wrong with This Picture?" Scientific American: 80-87.
  35. Malviya, S., T. Voepel-Lewis, et al. (2002). "Depth of sedation in children undergoing computed tomography: validity and reliability of the University of Michigan Sedation Scale (UMSS)." Br J Anaesth 88(2): 241-5.
  36. Matheson, D. M., K. A. Hanson, et al. (2002). "Validity of Children's Food Portion Estimates: A Comparison of 2 Measurement Aids." Arch Pediatr Adolesc Med 156(9): 867-71.
  37. Moussa, M. A., M. Z. Shafie, et al. (1990). "Reliability of death certificate diagnoses." J Clin Epidemiol 43(12): 1285-95.
  38. Muller, C. (2000). "Rationale, interpretation, validation, and uses of sperm function tests." Journal of Andrology 21(1): 10-30.
  39. Nekolaichuk, C. L., E. Bruera, et al. (1999). "A comparison of patient and proxy symptom assessments in advanced cancer patients." Palliat Med 13(4): 311-23.
  40. No authors listed (1993). "The CRIB (clinical risk index for babies) score: a tool for assessing initial neonatal risk and comparing performance of neonatal intensive care units. The International Neonatal Network." Lancet 342(8865): 193-8.
  41. Penetar, D., U. McCann, et al. (1993). "Caffeine reversal of sleep deprivation effects on alertness and mood." Psychopharmacology (Berl) 112(2-3): 359-65.
  42. Richardson, D. K., J. E. Gray, et al. (1993). "Score for Neonatal Acute Physiology: a physiologic severity index for neonatal intensive care." Pediatrics 91(3): p617-23.
  43. Sanders, C., M. Egger, et al. (1998). "Reporting on quality of life in randomised controlled trials: bibliographic study." Bmj 317(7167): 1191-4.
  44. Schoonhoven, L., J. R. Haalboom, et al. (2002). "Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers." Bmj 325(7368): 797.
  46. Shrout, P. E. and J. L. Fleiss (1979). "Intraclass Correlations: Uses in Assessing Rater Reliability." Psychological Bulletin 86(2): 420-28.
  47. Simpson, D. and R. Fincher (1999). "Making a Case for the Teaching Scholar." Academic Medicine 74(12): 28-31.
  48. Stelle, K. M., K. E. Bass, et al. (1999). "The Mystery of the Mozart Effect: Failure to Replicate." Psychological Science 10(4): 366-369.
  49. Sunmola, A. M. (2001). "Developing a scale for measuring the barriers to condom use in Nigeria." Bull World Health Organ 79(10): p926-32.Taylor, J. K. (1983). "Validation of Analytical Methods." Analytical Chemistry.
  50. The Royal Windsor Society for Nursing Research. Instrument Validity.
  51. Thomas, E., D. Studdert, et al. (2002). "The reliability of medical record review for estimating adverse event rates." Ann Intern Med 136(11): 812-816.
  52. Thompson, F., H. Metzner, et al. (1990). "Characteristics of individuals and long term reproductibility of dietary reports: the Tecumseh Diet Methodology Study." J Clin Epidemiol 43(11): 1169-78.
  53. Trochim, W. M. K. Types of Reliability.
    Excerpt: You learned in the Theory of Reliability that it's not possible to calculate reliability exactly. Instead, we have to estimate reliability, and this is always an imperfect endeavor. Here, I want to introduce the major reliability estimators and talk about their strengths and weaknesses.
  54. Walsh, D. A. and D. A. Gentile (2001). "A validity test of movie, television, and video-game ratings." Pediatrics 107(6): p1302-8.
  55. Wells, A., P. English, et al. (1998). "Misclassification rates for current smokers misclassified as nonsmokers." American Journal of Public Health 88(10): 1503-09.
  56. Werler, M. M., B. R. Pober, et al. (1989). "Reporting accuracy among mothers of malformed and nonmalformed infants." Am J Epidemiol 129(2): p415-21.
  58. Wirfait, A., R. Jeffery, et al. (1998). "Comparison of food frequency questionnaires: the reduced block and Willett questionnaires differ in ranking on nutrient intakes." AJE 148(12): 1148-56.
  59. Wright, S. P. (1999). "Reporting on quality of life in RCTs." British Medical Journal 318(7191): 1142.
  60. Instrument Validity. The Royal Windsor Society for Nursing Research. Accessed on November 4, 2002.
  61. The Examination Chapter of the Neurology section of the Family Practice Notebook,, has some interesting tests like the mini-mental state exam that could serve as good examples of developing validity.
  69. The Cochrane Group defines internal validity as "the extent to which the observed effects are true for the people in a study" --
  70. J Psychosom Res 2002 Feb;52(2):69-77. The validity of the Hospital Anxiety and Depression Scale. An updated literature review. Bjelland I, Dahl AA, Haug TT, Neckelmann D.
  71. J Cardiopulm Rehabil 2002 May-Jun;22(3):184-91. Validation of the Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable angina. Lewin RJ, Thompson DR, Martin CR, Stuckey N, Devlen J, Michaelson S, Maguire P.Psychosomatics 2001 Sep-Oct;42(5):423-8. Sensitivity and specificity of observer and self-report questionnaires in major and minor depression following myocardial infarction. Strik JJ, Honig A, Lousberg R, Denollet J.
  72. Disabil Rehabil 2001 Nov 10;23(16):737-44. Screening for anxiety, depressive and somatoform disorders in rehabilitation--validity of HADS and GHQ-12 in patients with musculoskeletal disease. Harter M, Reuter K, Gross-Hardt K, Bengel J.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Measuring agreement.