Stats: What is a Kappa coefficient? (Cohen's Kappa)

StATS: What is a Kappa coefficient? (Cohen's Kappa)

When two binary variables are attempts by two individuals to measure the same thing, you can use Cohen's Kappa (often simply called Kappa) as a measure of agreement between the two individuals.

Kappa measures the percentage of data values in the main diagonal of the table and then adjusts these values for the amount of agreement that could be expected due to chance alone.

Two raters are asked to classify objects into categories 1 and 2. The table below contains cell probabilities for a 2 by 2 table.

To compute Kappa, you first need to calculate the observed level of agreement

This value needs to be compared to the value that you would expect if the two raters were totally independent,

The value of Kappa is defined as

The numerator represents the discrepancy between the observed probability of success and the probability of success under the assumption of an extremely bad case. Independence implies that pair of raters agree about as often as two pairs of people who effectively flip coins to make their ratings.

The maximum value for kappa occurs when the observed level of agreement is 1, which makes the numerator as large as the denominator. As the observed probability of agreement declines, the numerator declines. It is possible for Kappa to be negative, but this does not occur too often. In such a case, you should interpret the value of Kappa to imply that there is no effective agreement between the two rates.

How to interpret Kappa

Kappa is always less than or equal to 1. A value of 1 implies perfect agreement and values less than 1 imply less than perfect agreement.

In rare situations, Kappa can be negative. This is a sign that the two observers agreed less than would be expected just by chance.

It is rare that we get perfect agreement. Different people have different interpretations as to what is a good level of agreement. At the bottom of this page is one interpretation, provided on page 404 of Altman DG. Practical Statistics for Medical Research. (1991) London England: Chapman and Hall.

Here is one possible interpretation of Kappa.

Poor agreement = Less than 0.20

Fair agreement = 0.20 to 0.40

Moderate agreement = 0.40 to 0.60

Good agreement = 0.60 to 0.80

Very good agreement = 0.80 to 1.00

An example of Kappa

In an examination of self reported prescription use and prescription use estimated by electronic medical records

http://www.biomedcentral.com/1472-6963/6/115

the following table was observed.

4.5% 11.2% 10.6% 73.8%

The value for Kappa is 0.16, indicating a poor level of agreement.

A second example of Kappa.

The following table represents the diagnosis of biopsies from 40 patients with self-reported malignant melanoma. The rows represent the first pathologist's diagnosis and the columns represent the second pathologist's diagnosis. Compute Kappa.

Again, this is only a fair level agreement. Notice that even though the pathologists agree 70% of the time, they would be expected to have almost as large a level of agreement (62%) just by chance alone.

Using SPSS to compute Kappa

As before, select ANALYZE | DESCRIPTIVE STATISTICS | CROSSTABS from the SPSS menu. In the dialog box, click on the STATISTICS button and then select the Kappa option box.

At the bottom of the page is what the SPSS output would look like.

Further reading

I have a lot of references for kappa and the intraclass correlation coefficient that I need to sort through.

Here's an interesting question related to this topic: Bill asks how to determine if a sample size is adequate for estimating an intraclass correlation.

The simplest approach is to see if the confidence interval that you have produced (or will produce) is sufficiently narrow to meet your needs. The confidence interval formulas are messy, but if you want to pursue this further, Shoukri and Edge have a book that may help.

Nico van Duijn published a nice bibliography for this topic on the Evidence Based Health listserver (subscribe at listserv@mailbase.ac.uk and send messages to evidence-based-health@mailbase.ac.uk). I will draw from this bibliography to write my page.

Another good reference, specifically about Kappa is www.hassey.demon.co.uk/kappa.rtf which requires a word processor that can read RTF (Rich Text Format) files.

http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm discusses measures of agreement. This author criticizes kappa.

Here's an email that might make the basis for an Ask Professor Mean question.

I met with you at the start of my dissertation and found your advice very helpful. I am in the process of finishing up my data and have a quick question that I thought you might could help with. I did behavioral observations for my study, and had one person code all the data, and another person code 20% of the data for reliability. I would like to use the Kappa equation to determine the reliability between my coders. I know I need to calculate four numbers: 1) total number agreements the behavior occurred; 2) total number agreements the behavior did not occur; 3) number of times coder A said yes and coder B said no, and 4) number of times coder A said no and Coder B said yes. My question is what do I do with those numbers to get a Kappa score? I know SPSS will do it if I enter all the data--but that would be hundreds of data points per subjects, and would take much longer than calculating it by hand. Any information you could provide would be greatly appreciated. Thanks! Rebecca

A coefficient of agreement for nominal scale. Cohen J. Educat Psychol Measure 1960; 20: 37-46.
Intraclass correlation coefficient as a measure of reliability. Bartko JJ. Psychol Reports 1966; 19: 3-11.
Weighted kappa; nominal scale agreement with provision for scaled disagreement or partial credit. Cohen J. Psychol Bull 1968; 70: 213-20.
On various intraclass correlation reliability coefficients. Bartko JJ. Psychol Bull 1976; 83: 762-5.
The measurement of observer agreement for categorical data. Landis JR, Koch GG. Biometrics 1977; 33: 159-74.
Clinical biostatistics LIV; the biostatistics of concordance. Kramer MS, Feinstein AR. Clin Pharmacol Ther 1981; 29: 111-23.
Statistical methods for assessing observer variability in clinical measures. Brennan P, Silman A. Brit Med J 1992; 304: 1491-4.
A reappraisal of the kappa coefficient. Thompson WD, Walter SD. J Clin Epidemiol 1988; 949-58.
Bias, prevalence and kappa. Byrt T, Bishop J, Carlin JB. J Clin Epidemiol 1993; 46 etc?
Health measurement scales, 4th ed. Streiner DL, Norman GR. Oxford: Oxford Univ Press, 1994. One interpretation of kappa is: < .40: poor, .40-.59 fair, .60-.74 good, >.74 excellent.
Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology, Cichetti (1994) Psychological Assessment, 6, 284-290.
Large sample variance of Kappa in the case of different sets of raters. Fleiss, J.L., Nee, J.C.M., & Landis, J. R. (1979)Psychological Bulletin, 86, 974-977.
Feinstein AR, Cicchetti DV. High Agreement but low kappa: The problems of two paradoxes. J Clin Epidem. 1990;43:543-9.
Cicchetti DV, Feinstein AR. High Agreement but low kappa: Resolving the paradoxes. J Clin Epidem. 1990;43:551-8.
Thompson WD, Walter SD. Kappa and the Concept of Independent Errors. J Clin Epidem. 1988;41:969-70.
Thompson WD, Walter SD. A reappraisal of the kappa coefficient. J Clin Epidem. 1988;41: 949-58.
Byrt T, Bishop J and Carlin JB (1993) Bias, prevalence and kappa. Journal of Clinical Epidemiology 46: 423.
Lantz CA and Nebenzahl E (1996) Behavior and interpretation of the kappa statistics: resolution of the two paradoxes. Journal of Clinical Epidemiology 49:431.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Definitions, Category: Measuring agreement.