Stats: What is construct validity? (March 8, 2006)

StATS: What is construct validity? (March 8, 2006)

Someone asked me to define face validity, criterion validity, and construct validity. That's a tall order. In general, validity means that a measurement that we take represents what we think it should. This is important to establish, because many times we think we are measuring one thing, but we are measuring something else entirely. It is important to remember that validity is a journey and not a goal. You never reach a place called the land of valid measurements. Instead, you gradually strengthen the evidence for validity, but there is no threshold that you cross where you can say, "We can now conclude that the measure is valid." Similarly, there is no region we can point to where we can say with confidence "We have not yet reached the point where we can say that the measure is valid."

Let me tackle the last definition first. Construct validity is the degree to which a direct measurement represents an unobserved social construct. To establish construct validity, you demonstrate that the measure changes in a logical way when other conditions change. For example, the study described below, faculty viewed videotapes of "standardized residents" who depicted either unsatisfactory, marginal/satisfactory, or high satisfactory/superior performance. The ratings given by the miniCEX reflected the actual performance levels, demonstrating validity with the standardized resident construct.

Construct Validity of the MiniClinical Evaluation Exercise (MiniCEX). E. S. Holmboe, S. Huot, J. Chung, J. Norcini, R. E. Hawkins. Acad Med 2003: 78(8); 826-830. [Medline] [Abstract] [Full text] [PDF] PURPOSE: To investigate the construct validity of the miniclinical evaluation exercise (miniCEX). METHOD: Forty faculty participants from 16 internal medicine residency programs enrolled in a randomized, controlled trial of faculty development. Using a standard nine-point miniCEX rating form, participants watched and rated performances of standardized residents on nine scripted clinical videotapes depicting three levels of performance (unsatisfactory, marginal/satisfactory, and high satisfactory/superior). The nine-point rating scale was 1-3 = unsatisfactory, 4-6 = marginal/satisfactory, and 7-9 = superior. The performances were rated for three clinical skills, history taking, physical examination, and counseling. RESULTS: For each of the three clinical skills, the faculty participants were able to successfully discriminate among the three levels of performance using the miniCEX scale. Differences among ratings of the three performance levels were statistically significant; however, the range in ratings among the participants for each videotape was wide. CONCLUSION: The authors believe this to be the first study to document the construct validity of the miniCEX. Although the miniCEX appears to have reliability and construct validity, further research is needed to improve individual faculty observation skills and reduce interrater variability.

Further reading

Examination of instruments used to rate quality of health information on the internet: chronicle of a voyage with an unclear destination. A. Gagliardi, A. R. Jadad. Bmj 2002: 324(7337); 569-73. [Medline] [Abstract] [Full text] [PDF] OBJECTIVE: This study updates work published in 1998, which found that of 47 rating instruments appearing on websites offering health information, 14 described how they were developed, five provided instructions for use, and none reported the interobserver reliability and construct validity of the measurements. DESIGN: All rating instrument sites noted in the original study were visited to ascertain whether they were still operating. New rating instruments were identified by duplicating and enhancing the comprehensive search of the internet and the medical and information science literature used in the previous study. Eligible instruments were evaluated as in the original study. RESULTS: 98 instruments used to assess the quality of websites in the past five years were identified. Many of the rating instruments identified in the original study were no longer available. Of 51 newly identified rating instruments, only five provided some information by which they could be evaluated. As with the six sites identified in the original study that remained available, none of these five instruments seemed to have been validated. CONCLUSIONS: Many incompletely developed rating instruments continue to appear on websites providing health information, even when the organisations that gave rise to those instruments no longer exist. Many researchers, organisations, and website developers are exploring alternative ways of helping people to find and use high quality information available on the internet. Whether they are needed or sustainable and whether they make a difference remain to be shown.

Rating health information on the Internet: navigating to knowledge or to Babel? A. R. Jadad, A. Gagliardi. Jama 1998: 279(8); 611-4. [Medline] [Abstract] [Full text] [PDF] (Evidence, Mountain, Measurement Quality, Reliability/Validity) CONTEXT: The rapid growth of the Internet has triggered an information revolution of unprecedented magnitude. Despite its obvious benefits, the increase in the availability of information could also result in many potentially harmful effects on both consumers and health professionals who do not use it appropriately. OBJECTIVES: To identify instruments used to rate Web sites providing health information on the Internet, rate criteria used by them, establish the degree of validation of the instruments, and provide future directions for research in this area. DATA SOURCES: MEDLINE (1966-1997), CINHAL (1982-1997), HEALTH (1975-1997), Information Science Abstracts (1966 to September 1995), Library and Information Science Abstracts (1969-1995), and Library Literature (1984-1996); the search engines Lycos, Excite, Open Text, Yahoo, HotBot, Infoseek, and Magellan; Internet discussion lists; meeting proceedings; multiple Web pages; and reference lists. INSTRUMENT SELECTION: Instruments used at least once to rate the quality of Web sites providing health information with their rating criteria available on the Internet. DATA EXTRACTION: The name of the developing organization, Internet address, rating criteria, information on the development of the instrument, number and background of people generating the assessments, and data on the validity and reliability of the measurements. DATA SYNTHESIS: A total of 47 rating instruments were identified. Fourteen provided a description of the criteria used to produce the ratings, and 5 of these provided instructions for their use. None of the instruments identified provided information on the interobserver reliability and construct validity of the measurements. CONCLUSIONS: Many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm.

Evidence for the Factorial and Construct Validity of a Self-Report Concussion Symptoms Scale. S. G. Piland, R. W. Motl, M. S. Ferrara, C. L. Peterson. J Athl Train 2003: 38(2); 104-112. [Medline] [Abstract] [Full text] [PDF] OBJECTIVE: To evaluate the factorial and construct validity of the Head Injury Scale (HIS) among a sample of male and female collegiate athletes. DESIGN AND SETTING: Using a cross-sectional design, we established the factorial validity of the HIS scale with confirmatory factor analysis and the construct validity of the HIS with Pearson product moment correlation analyses. Using an experimental design, we compared scores on the HIS between concussed and nonconcussed groups with a 2 (groups) x 5 (time) mixed-model analysis of variance. SUBJECTS: Participants (N = 279) in the cross-sectional analyses were predominately male (n = 223) collegiate athletes with a mean age of 19.49 +/- 1.63 years. Participants (N = 33) in the experimental analyses were concussed (n = 17) and nonconcussed control (n = 16) collegiate athletes with a mean age of 19.76 +/- 1.49 years. MEASUREMENTS: All participants completed baseline measures for the 16-item HIS, neuropsychological testing battery, and posturography. Concussed individuals and paired controls were evaluated on days 1, 2, 3, and 10 postinjury on the same testing battery. RESULTS: Confirmatory factor analysis indicated that a theoretically derived, 3-factor model provided a good but not excellent fit to the 16-item HIS. Hence, the 16-item HIS was modified on the basis of substantive arguments about item-content validity. The subsequent analysis indicated that the 3-factor model provided an excellent fit to the modified 9-item HIS. The 3 factors were best described by a single second-order factor: concussion symptoms. Scores from the 16-item HIS and 9-item HIS were strongly correlated, but there were few significant correlations between HIS scores and scores from the neuropsychological and balance measures. A significant group-by-day interaction was noted on both the 9-item HIS and 16-item HIS, with significant differences seen between groups on days 1 and 2 postconcussion. CONCLUSIONS: We provide evidence for the factorial and construct validity of the HIS among collegiate athletes. This scale might aid in return-to-play decisions by physicians and athletic trainers.

A systematic review of the content of critical appraisal tools. P. Katrak, A. E. Bialocerkowski, N. Massy-Westropp, S. Kumar, K. A. Grimmer. BMC Med Res Methodol 2004: 4(1); 22. [Medline] [Abstract] [Full text] [PDF] BACKGROUND: Consumers of research (researchers, administrators, educators and clinicians) frequently use standard critical appraisal tools to evaluate the quality of published research reports. However, there is no consensus regarding the most appropriate critical appraisal tool for allied health research. We summarized the content, intent, construction and psychometric properties of published, currently available critical appraisal tools to identify common elements and their relevance to allied health research. METHODS: A systematic review was undertaken of 121 published critical appraisal tools sourced from 108 papers located on electronic databases and the Internet. The tools were classified according to the study design for which they were intended. Their items were then classified into one of 12 criteria based on their intent. Commonly occurring items were identified. The empirical basis for construction of the tool, the method by which overall quality of the study was established, the psychometric properties of the critical appraisal tools and whether guidelines were provided for their use were also recorded. RESULTS: Eighty-seven percent of critical appraisal tools were specific to a research design, with most tools having been developed for experimental studies. There was considerable variability in items contained in the critical appraisal tools. Twelve percent of available tools were developed using specified empirical research. Forty-nine percent of the critical appraisal tools summarized the quality appraisal into a numeric summary score. Few critical appraisal tools had documented evidence of validity of their items, or reliability of use. Guidelines regarding administration of the tools were provided in 43% of cases. CONCLUSIONS: There was considerable variability in intent, components, construction and psychometric properties of published critical appraisal tools for research reports. There is no "gold standard' critical appraisal tool for any study design, nor is there any widely accepted generic tool that can be applied equally well across study types. No tool was specific to allied health research requirements. Thus interpretation of critical appraisal of research reports currently needs to be considered in light of the properties and intent of the critical appraisal tool chosen for the task.

The measurement and monitoring of surgical adverse events [PDF]. J. Bruce, E. M. Russell, J. Mollison, Z. H. Krukowski. Accessed on 2003-08-15. [Excerpt] BACKGROUND: Surgical adverse events contribute significantly to postoperative morbidity, yet the measurement and monitoring of events is often imprecise and of uncertain validity. Given the trend of decreasing length of hospital stay and the increase in use of innovative surgical techniques--particularly minimally invasive and endoscopic procedures--accurate measurement and monitoring of adverse events is crucial. OBJECTIVES: The aim of this methodological review was to identify a selection of common and potentially avoidable surgical adverse events and to assess whether they could be reliably and validly measured, to review methods for monitoring their occurrence and to identify examples of effective monitoring systems for selected events. This review is a comprehensive attempt to examine the quality of the definition, measurement, reporting and monitoring of selected events that are known to cause significant postoperative morbidity and mortality. METHODS - SELECTION OF SURGICAL ADVERSE EVENTS: Four adverse events were selected on the basis of their frequency of occurrence and likelihood of evidence of measurement and monitoring: (1) surgical wound infection; (2) anastomotic leak; (3) deep vein thrombosis (DVT); (4) surgical mortality. Surgical wound infection and DVT are common events that cause significant postoperative morbidity. Anastomotic leak is a less common event, but risk of fatality is associated with delay in recognition, detection and investigation. Surgical mortality was selected because of the effort known to have been invested in developing systems for monitoring surgical death, both in the UK and internationally. Systems for monitoring surgical wound infection were also included in the review. METHODS - LITERATURE SEARCH: Thirty separate, systematic literature searches of core health and biomedical bibliographic databases (MEDLINE, EMBASE, CINAHL, HealthSTAR and the Cochrane Library) were conducted. The reference lists of retrieved articles were reviewed to locate additional articles. A matrix was developed whereby different literature and study designs were reviewed for each of the surgical adverse events. Each article eligible for inclusion was independently reviewed by two assessors. METHODS - CRITICAL APPRAISAL: Studies were appraised according to predetermined assessment criteria. Definitions and grading scales were assessed for: content, criterion and construct validity; repeatability; reproducibility; and practicality (surgical wound infection and anastomotic leak). Monitoring systems for surgical wound infection and surgical mortality were assessed on the following criteria: (1) coverage of the system; (2) whether or not denominator data were collected; (3) whether standard and agreed definitions were used; (4) inclusion of risk adjustment; (5) issues related to data collection; (6) postdischarge surveillance; (7) output in terms of feedback and wider dissemination. RESULTS - SURGICAL WOUND INFECTION: A total of 41 different definitions and 13 grading scales of surgical wound infection were identified from 82 studies. Definitions of surgical wound infection varied from presence of pus to complex definitions such as those proposed by the Centres for Disease Control in the USA. A small body of literature has been published on the content, criterion and construct validity of different definitions, and comparisons have been made against wound assessment scales and multidimensional indices. There are examples of comprehensive hospital-based monitoring systems of surgical wound infection, mainly under the auspices of nosocomial surveillance. To date, however, there is little evidence of systematic measurement and monitoring of surgical wound infection after hospital discharge. RESULTS - ANASTOMOTIC LEAK: Over 40 definitions of anastomotic leak were extracted from 107 studies of upper gastrointestinal, hepatopancreaticobiliary and lower gastrointestinal surgery. No formal evaluations were found that assessed the validity or reliability of definitions or severity scales of anastomotic leak. One definition was proposed during a national consensus workshop, but no evidence of its use was found in the surgical literature. The lack of a single definition or gold standard hampers comparison of postoperative anastomotic leak rates between studies and institutions. RESULTS - DEEP VEIN THROMBOSIS: Although a critical review of the DVT literature could not be completed within the realms of this review, it was evident that a number of new techniques for the detection and diagnosis of DVT have emerged in the last 20 years. The group recommends a separate review be undertaken of the different diagnostic tests to detect DVT. RESULTS - SURGICAL MORTALITY MONITORING SYSTEMS: The definition of surgical mortality is relatively consistent between monitoring systems, but duration of follow-up of death postdischarge varies considerably. The majority of systems report in-hospital mortality rates; only some have the potential to link deaths to national death registers. Risk assessment is an important factor and there should be a distinction between recording pre-intervention factors and postoperative complications. A variety of risk scoring systems was identified in the review. Factors associated with accurate and complete data collection include the employment of local, dedicated personnel, simple and structured prompts to ensure that clinical input is complete, and accurate and automated data capture and transfer. CONCLUSIONS: The use of standardised, valid and reliable definitions is fundamental to the accurate measurement and monitoring of surgical adverse events. This review found inconsistency in the quality of reporting of postoperative adverse events, limiting accurate comparison of rates over time and between institutions. The duration of follow-up for individual events will vary according to their natural history and epidemiology. Although risk-adjusted aggregated rates can act as screening or warning systems for adverse events, attribution of whether events are avoidable or preventable will invariably require further investigation at the level of the individual, unit or department. CONCLUSIONS - RECOMMENDATIONS FOR RESEARCH: (1) A single, standard definition of surgical wound infection is needed so that comparisons over time and between departments and institutions are valid, accurate and useful. Surgeons and other healthcare professionals should consider adopting the 1992 Centers for Disease Control (CDC) definition for superficial incisional, deep incisional and organ/space surgical site infection for hospital monitoring programmes and surgical audits. There is a need for further methodological research into the performance of the CDC definition in the UK setting. (2) There is a need to formally assess the reliability of self-diagnosis of surgical wound infection by patients. (3) There is a need to assess formally the reliability of case ascertainment by infection control staff. (4) Work is needed to create and agree a standard, valid and reliable definition of anastomotic leak which is acceptable to surgeons. (5) A systematic review is needed of the different diagnostic tests for the diagnosis of DVT. (6) The following variables should be considered in any future DVT review: anatomical region (lower limb, upper limb, pelvis); patient presentation (symptomatic, asymptomatic); outcome of diagnostic test (successfully completed, inconclusive, technically inadequate, negative); length of follow-up; cost of test; whether or not serial screening was conducted; and recording of laboratory cut-off values for fibrinogen equivalent units. (7) A critical review is needed of the surgical risk scoring used in monitoring systems. (8) In the absence of automated linkage there is a need to explore the benefits and costs of monitoring in primary care. (9) The growing potential for automated linkage of data from different sources (including primary care, the private sector and death registers) needs to be explored as a means of improving the ascertainment of surgical complications, including death. This linkage needs to be within the terms of data protection, privacy and human rights legislation. (10) A review is needed of the extent of the use and efficiency of routine hospital data versus special collections or voluntary reporting. www.ncchta.org/fullmono/mon522.pdf

Validation of an Index of the Qualtiy of Review Articles. Andrew D. Oxman, Gordon H. Guyatt. Journal of Clinical Epidemiology 1991: 44(11); 1271-1278. ABSTRACT: The objective of this study was to assess the validity of an index of the scientific quality of research overviews, the Overview Quality Assessment Questionnaire (OQAQ). Thirty-six published review articles were assessed by 9 judges using the OQAQ. Authors reports of what they had done were compared to OQAQ ratings. The sensibility of the OQAQ was assessed using a 13 item questionnaire. Seven a priori hypotheses were used to assess construct validity. The review articles were drawn from three sampling frames: articles highly rated by criteria external to the study, meta-analyses, and a broad spectrum of medical journals. Three categories of judges were used to assess the articles: research assistants, clinicians with research training and experts in research methodology, with 3 judges in each category. The sensibility of the index was assessed by 15 randomly selected faculty members of the Department of Clinical Epidemiology and Biostatistics at McMaster. Authors' reports of their methods related closely to ratings from corresponding OQAQ items: for each criterion, the mean score was significantly higher for articles for which the authors responses indicated that they had used more rigorous methods. For 10 of the 13 questions used to assess sensibility the mean rating was 5 or greater, indicating general satisfaction with the instrument. The primary shortcoming noted was the need for judgement in applying the index. Six of the 7 hypotheses used to test construct validity held true. The OQAQ is a valid measure of the quality of research overviews.

Construct Validity in Psychological Tests. Lee J. Cronbach. Psychological Bulletin 1955: 52; 281-302. [Excerpt] Validation of psychological tests has not yet been adequately conceptualized, as the APA Committee on Psychological Tests learned when it undertook (1950-54) to specify what qualities should be investigated before a test is published. In order to make coherent recommendations the Committee found it necessary to distinguish four types of validity, established by different types of research and requiring different interpretation. The chief innovation in the Committee's report was the term construct validity.[2] This idea was first formulated by a subcommittee (Meehl and R. C. Challman) studying how proposed recommendations would apply to projective techniques, and later modified and clarified by the entire Committee (Bordin, Challman, Conrad, Humphreys, Super, and the present writers). The statements agreed upon by the Committee (and by committees of two other associations) were published in the Technical Recommendations (59). The present interpretation of construct validity is not "official" and deals with some areas where the Committee would probably not be unanimous. The present writers are solely responsible for this attempt to explain the concept and elaborate its implications.

Depth of sedation in children undergoing computed tomography: validity and reliability of the University of Michigan Sedation Scale (UMSS). S. Malviya, T. Voepel-Lewis, A. R. Tait, S. Merkel, K. Tremper, N. Naughton. Br J Anaesth 2002: 88(2); 241-5. [Medline] BACKGROUND: Safe care of sedated children requires ongoing assessment of the depth of sedation to permit early recognition of progression to over-sedation. This study evaluated the validity and reliability of the University of Michigan Sedation Scale (UMSS) as a measure of sedation during procedures. The UMSS is a simple observational tool that assesses the level of alertness on a five-point scale ranging from 1 (wide awake) to 5 (unarousable with deep stimulation). METHODS: Thirty-two children aged 4 months to 5 yr (mean 1.5 yr), sedated for computed tomography (CT), were studied prospectively. The CT nurse assessed sedation using the UMSS before sedative administration and every 10 min thereafter. The child was videotaped during each assessment, and segments were edited and their order was randomized. Four nurses blinded to sedative administration viewed the segments and scored sedation using the UMSS. One of these nurses also scored sedation using a visual analogue scale (VAS) and another using the Observer's Assessment of Alertness/Sedation Scale (OAAS). To examine the test-retest reliability, 75 randomly selected video segments were viewed and scored on a second occasion. RESULTS: Changes in scores from baseline to discharge supported construct validity (P<0.0001). Criterion validity was demonstrated by significant correlations between the UMSS and the VAS and OAAS. There was good interobserver agreement between blinded observers' scores for each level of sedation and at discharge, and between blinded observers and the CT nurse for scores of 0 and 1 (lighter levels of sedation), but less agreement for scores 2 and 3 (deeper sedation) and discharge scores. Test-retest reliability was supported by agreement in the observers' UMSS scores. CONCLUSION: The UMSS is a simple, valid and reliable tool that facilitates rapid and frequent assessment and documentation of depth of sedation in children.

A patient survey system to measure quality improvement: questionnaire reliability and validity. R. G. Carey, J. H. Seibert. Med Care 1993: 31(9); 834-45. This study describes the results of a four-year research effort to develop inpatient and outpatient questionnaires that have sufficient validity and reliability to be used to measure patient perceptions of quality. As part of this effort, over 50,000 inpatients, emergency room patients, and ambulatory surgery patients from over 300 hospitals representing every US census region were surveyed. Separate questionnaires, called Quality of Care Monitors, were developed for inpatients and outpatients. The inpatient questionnaire consisted of 8 scales: Physician Care, Nursing Care, Medical Outcome, Courtesy, Food Service, Comfort and Cleanliness, Admissions/Billing, and Religious Care. The outpatient questionnaire had 7 scales: Physician Care, Nursing Care, Medical Outcome, Facility Characteristics, Waiting Time, Testing Services and Registration Process. The study found strong evidence of construct validity, predictive validity, and internal consistency for both questionnaires. Each questionnaire is capable of measuring separate dimensions of patient experience. A data bank developed from these questionnaires is currently accessed regularly by participating hospitals to assess quality improvement and to make benchmark comparisons with similar hospitals.

A Plethora of Threats: A Mildly Amusing Guide for the Weary Student and Anyone Else Encountering the How To's and What If's of Construct Validity.. Nicole M. Driebe. Accessed on 2003-09-17. [Excerpt] Warning: This web page may cause severe gastrointestinal disorders, bloodshot eyes and various other stress-related pains -- particularly for those who are just about to engage in their thesis research (and thought they had thought of everything!). Anyone planning on finishing graduate school in less than 10 years should consult Dr. Daniels (Jack, of course) before reading further.Also note: The events and characters portrayed here are purely fictional. If anyone or any situation resemble you or your own situation in any way -- join the club. trochim.human.cornell.edu/tutorial/driebe/tweb1.htm

Reliability and validity of the Children's Health Survey for Asthma. L. Asmussen, L. M. Olson, E. N. Grant, J. Fagan, K. B. Weiss. Pediatrics 1999: 104(6); e71. [Medline] OBJECTIVE: Describe the psychometric properties of the Children's Health Survey for Asthma (CHSA)- a condition-specific, self-report, functional health measure for parents of children 5 to 12 years of age with chronic asthma. METHOD: Data from two cross-sectional and one longitudinal study were used to assess internal consistency reliability, test-retest reliability, and validity of the CHSA. Over 275 parents and guardians of children with asthma completed the CHSA in one of three studies. The combined samples included a heterogenous mix of respondents by child age and race/ethnicity and parental marital and socioeconomic status. Five domain scores were computed: physical health, activity (child), activity (family), emotional health (child), and emotional health (family). Raw scale scores were transformed from 0 to 100 with higher scores indicating better or more positive outcomes. RESULTS: Across the three samples, mean scale scores ranged from a low of 61.5 (emotional health of the child) to a high of 86.1 (activity [family]). Internal consistency reliability for each of the scales was high (Cronbach's alpha =.81-. 92), and test-retest reliability (correlation between forms) ranged from.62 to.86. Significant differences in mean scores for four of five scales were noted between those with low versus moderate to high recent symptom activity. CONCLUSION: In three tests, the CHSA displays strong reliability and validity. Descriptive statistics demonstrate a range of scale scores. Internal consistency is good to excellent and short-term test-retest reliability is good for each of the five scales. Construct validity is demonstrated by the ability of CHSA to distinguish levels of disease severity, defined by symptom activity.

Reliability and validity of the Women's Health Initiative Insomnia Rating Scale. D. W. Levine, D. F. Kripke, R. M. Kaplan, M. A. Lewis, M. J. Naughton, D. J. Bowen, S. A. Shumaker. Psychol Assess 2003: 15(2); 137-48. [Medline] Reliability and construct validity of the 5-item Women's Health Initiative Insomnia Rating Scale (WHIIRS) were evaluated in 2 studies. In Study 1, using a sample of 66,269 postmenopausal women, validity of the WHIIRS was assessed by examining its relationship to other measures known to be related to sleep quality. Reliability of the WHIIRS was estimated using a resampling approach; the mean alpha coefficient was.78. Test-retest reliability coefficients were.96 for same-day administration and.66 after a year or more. Correlations of the WHIIRS with the other measures were in the predicted directions. Study 2 used a sample of 459 women and compared the WHIIRS with objective indicators of sleep quality. Results showed that differences in the objective indicators could be detected by the WHIIRS. Findings suggest that a between-group mean difference of approximately 0.50 of a standard deviation on the WHIIRS may be clinically meaningful.

Validation of 2 pain scales for use in the pediatric emergency department. B. Bulloch, M. Tenenbein. Pediatrics 2002: 110(3); e33. [Medline] OBJECTIVE: To determine the construct, content, and convergent validity of 2 self-report pain scales for use in the untrained child in the emergency department (ED). METHODS: A prospective study was conducted of all children who presented to an urban ED between 5 and 16 years of age inclusive after written informed consent was obtained. Children were excluded if they were intoxicated, had altered sensorium, were clinically unstable, did not speak English, or had developmental delays. Children marked their current pain severity on a standardized Color Analog Scale (CAS) and a 7-point Faces Pain Scale (FPS). They were then asked whether their pain was mild, moderate, or severe. Children were then administered an analgesic at the discretion of the attending physician and asked to repeat these measurements. For assessing content validity, the scales were also administered to age- and gender-matched children in the ED for nonpainful conditions. Convergent validity was assessed by determining the Spearman correlation coefficient between the 2 pain scales. RESULTS: A total of 60 children were enrolled, 30 with pain and 30 without, with a mean age of 9.3 +/- 3.3 years. Boys accounted for 38 of the enrollees (63.3%). The median score before analgesic administration was 6.0 cm (interquartile range [IQR]: 4.0-8.0) on the CAS and 3.0 faces (IQR: 2.0-5.0) on the FPS; after analgesic administration, the median scores decreased to 3.1 cm (IQR: 1.1-4.3) and 2.0 faces (IQR: 1.0-3.0), respectively. As the reported pain intensity increased, so did the scores on the 2 pain scales. The 30 children with no pain had a median score on the CAS of 0.0 (IQR: 0.0-1.0) and on the FPS of 0.0 (IQR: 0.0-1.0), whereas the 13 children with severe pain had a median CAS of 7.0 (IQR: 6.0-8.0) and a median FPS of 5.0 (IQR: 4.0-6.0). The Spearman correlation coefficient between the CAS and the FPS was positive and strong (r = 0.894). CONCLUSION: The CAS and the FPS exhibit construct, content, and convergent validity in the measurement of acute pain in children in the ED.

Cross-validation of a composite pain scale for preschool children within 24 hours of surgery. S. Suraseranivongse, U. Santawat, K. Kraiprasit, S. Petcharatana, S. Prakkamodom, N. Muntraporn. Br J Anaesth 2001: 87(3); 400-5. [Medline] [Abstract] [Full text] [PDF] This study was designed to cross-validate a composite measure of the pain scales CHEOPS (Children's Hospital of Eastern Ontario Pain Scale), OPS (Objective Pain Scale, simplified for parent use by replacing blood pressure measurement with observation of body language or posture), TPPPS (Toddler Preschool Postoperative Pain Scale) and FLACC (Face, Legs, Activity, Cry, Consolability) in 167 Thai children aged 1-5.5 yr. The pain scales were translated and tested for content, construct and concurrent validity, including inter-rater and intra-rater reliabilities. Discriminative validity in immediate and persistent pain for the age groups < or =3 and >3 yr were also studied. The children's behaviour was videotaped before and after surgery, before analgesia had been given in the post-anaesthesia care unit (PACU), and on the ward. Four observers then rated pain behaviour from rearranged videotapes. The decision to treat pain was based on routine practice and was made by a researcher unaware of the rating procedure. All tools had acceptable content validity and excellent inter-rater and intra-rater reliabilities (intraclass correlation >0.9 and >0.8 respectively). Construct validity was determined by the ability to differentiate the group with no pain before surgery and a high pain level after surgery, before analgesia (P<0.001). The positive correlations among all scales in the PACU and on the ward (r=0.621-0.827, P<0.0001) supported concurrent validity. Use of the kappa statistic indicated that CHEOPS yielded the best agreement with the routine decision to treat pain. The younger and older age groups both yielded very good agreement in the PACU but only moderate agreement on the ward. On the basis of data from this study, we recommend CHEOPS as a valid, reliable and practical tool.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Measuring agreement.