StATS: Reliability/Validity (January 13, 2004)

Dear Professor Mean, How do I show that a measurement has validity and reliability? Meek Mark

Dear Meek,

Just list Professor Mean as one of your co-authors. My reputation will establish validity and reliability all by itself.

Short Answer

The terms validity and reliability have different meanings to different people, so be sure you understand exactly what is needed for your particular discipline. As a general rule, reliability is established by showing a good level of correlation, concordance, or agreement between multiple assessments of the same measurement. For example, you could ask two people to rate the same individual patient and then compare the results. Or if the condition is stable over time, you might ask for assessments at two different times of your measurement. You establish validity by comparing your measurement to an external standard. The hard part is coming up with an appropriate standard.

Estimating reliability from repeated measurements by the same observer.

The simplest way to assess reliability is to ask an observer to measure the same patient twice. Common sense tells you to be careful here. Don't schedule the two measurements in such a way that the first measurement would influence the second measurement. Make sure the measurements are far enough apart in time that the observer is unlikely to remember at the time of the second measurement what the first measurement was. On the other hand, don't schedule the measurements so far apart that the patient has had enough time to substantially improve or worsen on the condition you are trying to measure.

The estimate of reliability in this situation uses a one factor random effects model. This model works well when the repeated measurements are more or less exchangeable.

Generally, you will find it easiest to get two measurements on the same patient, but if circumstances allow for a third or fourth measurement, the one factor random effects model allows you to use those data. You can also have a different number of measurements on different patients without any problem. The formulas get a bit more complex, but statistical software like SPSS does all the work for you.

Here's some data from a chiropractic assessment of 19 individual patients. The assessment was the millimeters of distance along the spine of a particular condition. This measurement was done twice on each patient by the same chiropractor. Here's a listing of the data (turned sideways to save space).

Here's a simple graph of the data.

Notice that some measurements are very consistent (patient 11 has measurements that are only 6 mm apart) and that other measurements are not so consistent. (patient 16 has measurements that are 120 mm apart, patient 17 has measurements that are 103 mm apart).

To quantify the degree of reliability of these measurements, select ANALYZE | GENERAL LINEAR MODEL | VARIANCE COMPONENTS from the SPSS menu.

The measurement itself is the dependent variable and the code for each individual patient is the random factor in this model. There are several options for computing variance components.

The ANOVA option is the easiest method, but can sometimes lead to negative variance estimates, which are a bit difficult to interpret. In practice, a negative estimate is unlikely to occur in a reliability situation unless the measurement totally lacks any level of reliability.

Although SPSS does not directly produce an estimate of the intraclass correlation coefficient the formula is Var(subj) / (Var(subj) + Var(Error)). In this case it works out to be 1738 / (1738+1150) = 0.6.

As a general rule of thumb (Shoukri and Edge 1996), a reliability coefficient (r) is

So this measurement has good reliability, but not excellent.

On your own

Three additional chiropractors evaluated patients on two separate occasions. Here are the data.

Estimate a separate reliability coefficient for each of the three chiropractors.

Estimating reliability from  measurements by multiple observers.

The data shown above is interesting because you can re-organize it to represent another common method for reliability, having multiple observers provide a measurement for each patient.

Here is the data.

Here is a graph of this data.

This data represents a two factor random effects model, since you have to account for variability between raters and variability between subjects.

You have to specify a model without an interaction, because there is not enough data to estimate this interaction. Click on the MODEL button to do this.

You should specify a CUSTOM model and place the SUBJ and EV variables in the model and avoid specifying an interaction. When you have done this click on the CONTINUE button and the OK button in the previous dialog box.

 The output from SPSS gives the following three variance components.

The intraclass correlation is computed as Var(subj) / (Var(subj) + Var(ev) + Var(Error)). In this example, it would be 1538 / (1538 + 170 + 1943) = 0.42.

On your own

Here is the other half of the chiropractic reliability data. Calculate the reliability among these four observer.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Measuring agreement.