StATS: Establishing validity and reliability (November 6, 2002)
Dear Professor Mean, I need to establish validity and reliability of a
new measurement. How do I do this? --- Flustered Fred
Dear Flustered. Validity and reliability are two very dangerous words to
use, because they mean different things to different people. Here are some
issues to consider.
Face validity is the extent to which your measurement appears valid to the
general public. Does the measurement make intuitive sense? A closely related
concept is content validity, the extent to which your measurement makes
intuitive sense to a group of experts in the area.
Construct validity is establishing that you are measuring what you think
you are measuring. You do this in two very different ways. First you
establish that your measure correlates well with what it should correlate
well with (convergent validity) and is uncorrelated with what it should be
uncorrelated with (discriminant validity).
A Plethora of Threats: A Mildly Amusing Guide for the Weary Student and
Anyone Else Encountering the How To's and What If's of Construct Validity.
Nicole M. Driebe. Accessed on November 4, 2002.
Criterion validity is the extent to which your measurement correlates with
a gold standard. You can establish this two ways. First, you can see if your
measure predicts a future measure that indicates the same thing (predictive
validity). Second, you can see if your measure correlates well with other
concurrent measures which have been established as gold standards.
Two large hospitals in the Netherlands studied four risk assessment scales
for pressure ulcers (BMJ 2002; 325: 797 (12 October)). Four separate scales
were applied to patients without pressure ulcers who were expected to stay at
least five days in the hospital. These patients were then followed for up to
12 weeks to see if a pressure ulcer did develop. They showed that all four
scales had poor predictive validity--they were unable to make accurate
predictions of which patients would develop pressure ulcers. The authors
noted that there had been minimal evaluation of these scales beyond expert
opinion and literature review.
Prospective cohort study of routine use of risk assessment scales
for prediction of pressure ulcers. Lisette Schoonhoven, Jeen R E
Haalboom, Mente T Bousema, Ale Algra, Diederick E Grobbee, Maria H Grypdonck,
and Erik Buskens. BMJ 2002; 325: 797.
- Asmussen, L., L. M. Olson, et al. (1999). "Reliability and validity of the
Children's Health Survey for Asthma." Pediatrics 104(6): e71.
- Barry, D. (1996). "Differential recall bias and spurious associations in
case/control studies." Statistics in Medicine 15(23): 2603-16.
- Bartko, J. J. (1966). "The intraclass correlation coefficient as a measure
of reliability." Psychological Reports 19(1): 3-11.
- Bartko, J. J. (1976). "On Various Intraclass Correlation Reliability
Coefficients." Psychological Bulletin 83(5): 762-765.
- Bartko, J. J. (1994). "Measures of agreement: a single procedure."
Statistics in Medicine 13(5-7): 737-45.
- Beckett, M., M. Weinstein, et al. (2000). "Do health interview surveys
yield reliable data on chronic illness among older respondents?" American
Journal of Epidemiology 151(3): 315-23.
- Block, G. (1982). "A review of validations of dietary assessment methods."
American Journal of Epidemiology 115(4): 492-505.
- Bulloch, B. and M. Tenenbein (2002). "Validation of 2 pain scales for use
in the pediatric emergency department." Pediatrics 110(3): e33.
- Caan, B., M. Slattery, et al. (1998). "Comparison of the Block and the
Willett self-administered semiquantitative food frequency questionnaires with
an interviewer-administered dietary history." AJE 148(12): 1137-47.
- Carey, R. G. and J. H. Seibert (1993). "A patient survey system to measure
quality improvement: questionnaire reliability and validity." Med Care 31(9):
- Carroll, R. T. The Forer effect (a.k.a. the P.T. Barnum effect and
- Carroll, R. T. The Mozart Effect. The Mozart Effect is a term coined by
Alfred A. Tomatis for the alleged increase in brain development that occurs in
children under age 3 when they listen to the music of Wolfgang Amadeus Mozart.
- Carroll, R. T. Myers-Briggs Type Indicator®. www.skepdic.com/myersb.html
- Collins, S. L., R. A. Moore, et al. (1997). "The visual analogue pain
intensity scale: what is moderate pain in millimetres?" Pain 72(1-2): 95-7.
- Coughlin, S. S. (1990). "Recall bias in epidemiologic studies." J Clin
Epidemiol 43(1): 87-91.
- Cronbach, L. J. (1955). "Construct Validity in Psychological Tests."
Psychological Bulletin 52: 281-302.
- Crume, T. L., C. DiGuiseppi, et al. (2002). "Underascertainment of child
maltreatment fatalities by death certificates, 1990-1998." Pediatrics 110(2 Pt
1): e18 (1 - 6).
- Day, N., N. McKeown, et al. (2001). "Epidemiological assessment of diet: a
comparison of a 7-day diary with a food frequency questionnaire using urinary
markers of nitrogen, potassium and sodium." Int J Epidemiol 30(2): 309-17.
- Driebe, N. M. A Plethora of Threats: A Mildly Amusing Guide for the Weary
Student and Anyone Else Encountering the How To's and What If's of Construct
- Ellman, M. S., C. M. Viscoli, et al. (1997). "A new index of prognostic
severity for chronic asthma." Chest 112(3): 582-90.
- Grace, J. Research Fables from the Sisters Grinn, No. 2. Snow White and
the Seven Threats to Validity. www.urmc.rochester.edu/SON/Fables/snowht.htm
- Gray-Donald, K., J. O'Loughlin, et al. (1997). "Validation of a short
telephone administered questionnaire to evaluate dietary interventions in low
income communities in Montreal, Canada." Journal of Epidemiology and Community
Health 51(3): 326-331.
- Hanley, J., A. Capewell, et al. (2001). "Validity study of the severity
index, a simple measure of urinary incontinence in women [In Process
Citation]." British Medical Journal 322(7294): 1096-7.
- Jacobs, J., L. M. Jimenez, et al. (1994). "Treatment of acute childhood
diarrhea with homeopathic medicine: a randomized clinical trial in Nicaragua."
Pediatrics 93(5): 719-25.
- Jacobson, S. W., L. M. Chiodo, et al. (2002). "Validity of Maternal Report
of Prenatal Alcohol, Cocaine, and Smoking in Relation to Neurobehavioral
Outcome." Pediatrics 109(5): 815-825.
- Kipnis, V., D. Midthune, et al. (2001). "Empirical Evidence of Correlated
Biases in Dietary Assessment Instruments and Its Implications." Am. J.
Epidemiol. 153(4): 394-403.
- Labouvie, E., M. E. Bates, et al. (1997). "Age of First Use: Its
Reliability and Predictive Utility." Journal of Studies on Alcohol 58(6):
- Leffondre, K., M. Abrahamowicz, et al. (2002). "Modeling Smoking History:
A Comparison of Different Approaches." Am. J. Epidemiol. 156(9): 813-823.
- Lemaitre, R. N., I. B. King, et al. (1998). "Assessment of trans-fatty
acid intake with a food frequency questionnaire and validation with adipose
tissue levels of trans-fatty acids." Am J Epidemiol 148(11): 1085-93.
- Levine, D. (1994). "True scores, error, reliability, and unit of analysis
in environment and behavior research." Environment and Behavior 26(2): 261-92.
- Levine, D. W., D. F. Kripke, et al. (2003). "Reliability and validity of
the Women's Health Initiative Insomnia Rating Scale." Psychol Assess 15(2):
- Lewin, R. J., D. R. Thompson, et al. (2002). "Validation of the
Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable
angina." J Cardiopulm Rehabil 22(3): 184-91.
- Lilienfeld, S. O. (2001). "What's Wrong with This Picture? (Inkblot
Test)." Scientific American: 81 -87.
- Lilienfeld, S. O., J. M. Wood, et al. (2001). "What's Wrong with This
Picture?" Scientific American: 80-87.
- Malviya, S., T. Voepel-Lewis, et al. (2002). "Depth of sedation in
children undergoing computed tomography: validity and reliability of the
University of Michigan Sedation Scale (UMSS)." Br J Anaesth 88(2): 241-5.
- Matheson, D. M., K. A. Hanson, et al. (2002). "Validity of Children's Food
Portion Estimates: A Comparison of 2 Measurement Aids." Arch Pediatr Adolesc
Med 156(9): 867-71.
- Moussa, M. A., M. Z. Shafie, et al. (1990). "Reliability of death
certificate diagnoses." J Clin Epidemiol 43(12): 1285-95.
- Muller, C. (2000). "Rationale, interpretation, validation, and uses of
sperm function tests." Journal of Andrology 21(1): 10-30.
- Nekolaichuk, C. L., E. Bruera, et al. (1999). "A comparison of patient and
proxy symptom assessments in advanced cancer patients." Palliat Med 13(4):
- No authors listed (1993). "The CRIB (clinical risk index for babies)
score: a tool for assessing initial neonatal risk and comparing performance of
neonatal intensive care units. The International Neonatal Network." Lancet
- Penetar, D., U. McCann, et al. (1993). "Caffeine reversal of sleep
deprivation effects on alertness and mood." Psychopharmacology (Berl)
- Richardson, D. K., J. E. Gray, et al. (1993). "Score for Neonatal Acute
Physiology: a physiologic severity index for neonatal intensive care."
Pediatrics 91(3): p617-23.
- Sanders, C., M. Egger, et al. (1998). "Reporting on quality of life in
randomised controlled trials: bibliographic study." Bmj 317(7167): 1191-4.
- Schoonhoven, L., J. R. Haalboom, et al. (2002). "Prospective cohort study
of routine use of risk assessment scales for prediction of pressure ulcers."
Bmj 325(7368): 797.
- Shrout, P. E. and J. L. Fleiss (1979). "Intraclass Correlations: Uses in
Assessing Rater Reliability." Psychological Bulletin 86(2): 420-28.
- Simpson, D. and R. Fincher (1999). "Making a Case for the Teaching
Scholar." Academic Medicine 74(12): 28-31.
- Stelle, K. M., K. E. Bass, et al. (1999). "The Mystery of the Mozart
Effect: Failure to Replicate." Psychological Science 10(4): 366-369.
- Sunmola, A. M. (2001). "Developing a scale for measuring the barriers to
condom use in Nigeria." Bull World Health Organ 79(10): p926-32.Taylor, J. K.
(1983). "Validation of Analytical Methods." Analytical Chemistry.
- The Royal Windsor Society for Nursing Research. Instrument Validity.
- Thomas, E., D. Studdert, et al. (2002). "The reliability of medical record
review for estimating adverse event rates." Ann Intern Med 136(11): 812-816.
- Thompson, F., H. Metzner, et al. (1990). "Characteristics of individuals
and long term reproductibility of dietary reports: the Tecumseh Diet
Methodology Study." J Clin Epidemiol 43(11): 1169-78.
- Trochim, W. M. K. Types of Reliability. trochim.human.cornell.edu/kb/reltypes.htm
Excerpt: You learned in the Theory of Reliability that it's not possible to
calculate reliability exactly. Instead, we have to estimate reliability, and
this is always an imperfect endeavor. Here, I want to introduce the major
reliability estimators and talk about their strengths and weaknesses.
- Walsh, D. A. and D. A. Gentile (2001). "A validity test of movie,
television, and video-game ratings." Pediatrics 107(6): p1302-8.
- Wells, A., P. English, et al. (1998). "Misclassification rates for current
smokers misclassified as nonsmokers." American Journal of Public Health
- Werler, M. M., B. R. Pober, et al. (1989). "Reporting accuracy among
mothers of malformed and nonmalformed infants." Am J Epidemiol 129(2):
- Wirfait, A., R. Jeffery, et al. (1998). "Comparison of food frequency
questionnaires: the reduced block and Willett questionnaires differ in ranking
on nutrient intakes." AJE 148(12): 1148-56.
- Wright, S. P. (1999). "Reporting on quality of life in RCTs." British
Medical Journal 318(7191): 1142.
- Instrument Validity. The Royal Windsor Society for Nursing
Research. Accessed on November 4, 2002.
- The Examination Chapter of the Neurology section of the Family Practice
has some interesting tests like the mini-mental state exam that could serve as
good examples of developing validity.
- The Cochrane Group defines internal validity as "the extent to which the
observed effects are true for the people in a study" --
- J Psychosom Res 2002 Feb;52(2):69-77. The validity of the Hospital Anxiety
and Depression Scale. An updated literature review. Bjelland I, Dahl AA, Haug
TT, Neckelmann D.
- J Cardiopulm Rehabil 2002 May-Jun;22(3):184-91. Validation of the
Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable
angina. Lewin RJ, Thompson DR, Martin CR, Stuckey N, Devlen J, Michaelson S,
Maguire P.Psychosomatics 2001 Sep-Oct;42(5):423-8. Sensitivity and specificity
of observer and self-report questionnaires in major and minor depression
following myocardial infarction. Strik JJ, Honig A, Lousberg R, Denollet J.
- Disabil Rehabil 2001 Nov 10;23(16):737-44. Screening for anxiety,
depressive and somatoform disorders in rehabilitation--validity of HADS and
GHQ-12 in patients with musculoskeletal disease. Harter M, Reuter K, Gross-Hardt
K, Bengel J.
This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Measuring agreement.