The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements

Julius Sim, Chris C Wright, The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements, Physical Therapy, Volume 85, Issue 3, 1 March 2005, Pages 257–268, https://doi.org/10.1093/ptj/85.3.257

Navbar Search Filter Mobile Enter search term Search Navbar Search Filter Enter search term Search

Abstract

Purpose. This article examines and illustrates the use and interpretation of the kappa statistic in musculoskeletal research. Summary of Key Points. The reliability of clinicians' ratings is an important consideration in areas such as diagnosis and the interpretation of examination findings. Often, these ratings lie on a nominal or an ordinal scale. For such data, the kappa coefficient is an appropriate measure of reliability. Kappa is defined, in both weighted and unweighted forms, and its use is illustrated with examples from musculoskeletal research. Factors that can influence the magnitude of kappa (prevalence, bias, and nonindependent ratings) are discussed, and ways of evaluating the magnitude of an obtained kappa are considered. The issue of statistical testing of kappa is considered, including the use of confidence intervals, and appropriate sample sizes for reliability studies using kappa are tabulated. Conclusions. The article concludes with recommendations for the use and interpretation of kappa.

In musculoskeletal practice and research, there is frequently a need to determine the reliability of measurements made by clinicians—reliability here being the extent to which clinicians agree in their ratings, not merely the extent to which their ratings are associated or correlated. Defined as such, 2 types of reliability exist: (1) agreement between ratings made by 2 or more clinicians (interrater reliability) and (2) agreement between ratings made by the same clinician on 2 or more occasions (intrarater reliability).

In some cases, the ratings in question are on a continuous scale, such as joint range of motion or distance walked in 6 minutes. In other instances, however, clinicians' judgments are in relation to discrete categories. These categories may be nominal (eg, “present,” “absent”) or ordinal (eg, “mild,” “moderate,” “severe”); in each case, the categories are mutually exclusive and collectively exhaustive, so that each case falls into one, and only one, category. A number of recent studies have used such data to examine interrater or intrarater reliability in relation to: clinical diagnoses or classifications, 1– 4 assessment findings, 5– 9 and radiographic signs. 10– 12 These data require specific statistical methods to assess reliability, and the kappa (κ) statistic is commonly used for this purpose. This article will define and illustrate the kappa coefficient and will examine some potentially problematic issues connected with its use and interpretation. Sample size requirements, which previously were not readily available in the literature, also are provided.

Nature and Purpose of the Kappa Statistic

A common example of a situation in which a researcher may want to assess agreement on a nominal scale is to determine the presence or absence of some disease or condition. This agreement could be determined in situations in which 2 researchers or clinicians have used the same examination tool or different tools to determine the diagnosis. One way of gauging the agreement between 2 clinicians is to calculate overall percentage of agreement (calculated over all paired ratings) or effective percentage of agreement(calculated over those paired ratings where at least one clinician diagnoses presence of the disease). 13 Although these calculations provide a measure of agreement, neither takes into account the agreement that would be expected purely by chance. If clinicians agree purely by chance, they are not really “agreeing” at all; only agreement beyond that expected by chance can be considered “true” agreement. Kappa is such a measure of “true” agreement. 14 It indicates the proportion of agreement beyond that expected by chance, that is, the achieved beyond-chance agreement as a proportion of the possible beyond-chance agreement. 15 It takes the form:

$$\rm \kappa = \frac $$ In terms of symbols, this is: $$\kappa = \frac $$

where P_o is the proportion of observed agreements and P_c is the proportion of agreements expected by chance. The simplest use of kappa is for the situation in which 2 clinicians each provide a single rating of the same patient, or where a clinician provides 2 ratings of the same patient, representing interrater and intrarater reliability, respectively. Kappa also can be adapted for more than one rating per patient from each of 2 clinicians, 16, 17 or for situations where more than 2 clinicians rate each patient or where each clinician may not rate every patient. 18 In this article, however, our focus will be on the simple situation where 2 raters give an independent single rating for each patient or where a single rater provides 2 ratings for each patient. Here, the concern is with how well these ratings agree, not with their relationship with some “gold standard” or “true” diagnosis. 19

If used and interpreted appropriately, the kappa coefficient provides valuable information on the reliability of diagnostic and other examination procedures.

The data for paired ratings on a 2-category nominal scale are usually displayed in a 2 × 2 contingency table, with the notation indicated in Table 1. 20 This table shows data from 2 clinicians who assessed 39 patients in relation to the relevance of lateral shift, according to the McKenzie method of low back pain assessment. 9 Cells a and d indicate, respectively, the numbers of patients for whom both clinicians agree on the relevance or nonrelevance of lateral shift. Cells b and c indicate the numbers of patients on whom the clinicians disagree. For clinician 2, the total numbers of patients in whom lateral shift was deemed relevant or not relevant are given in the marginal totals, f₁ and f₂, respectively. The corresponding marginal totals for clinician 1 are g₁ and g₂.

Table 1.

Diagnostic Assessments of Relevance of Lateral Shift by 2 Clinicians, From Kilpikoski et al 9 (κ=.67) a

.		Clinician 2 .		Total .
.		Relevant .	Not relevant .	Total .
Clinician 1	Relevant	a 22	b 2	g₁ 24
Clinician 1	Not relevant	c 4	d 11	g₂ 15
Total	f₁ 26	f₂ 13	n 39

.		Clinician 2 .		Total .
.		Relevant .	Not relevant .	Total .
Clinician 1	Relevant	a 22	b 2	g₁ 24
Clinician 1	Not relevant	c 4	d 11	g₂ 15
Total	f₁ 26	f₂ 13	n 39

The letters in the upper left-hand corners of the cells indicate the notation used for a 2 × 2 contingency table. The main diagonal cells (a and d) represent agreement, and the off-diagonal cells (b and c) represent disagreement.

Table 1.

Diagnostic Assessments of Relevance of Lateral Shift by 2 Clinicians, From Kilpikoski et al 9 (κ=.67) a

.		Clinician 2 .		Total .
.		Relevant .	Not relevant .	Total .
Clinician 1	Relevant	a 22	b 2	g₁ 24
Clinician 1	Not relevant	c 4	d 11	g₂ 15
Total	f₁ 26	f₂ 13	n 39

.		Clinician 2 .		Total .
.		Relevant .	Not relevant .	Total .
Clinician 1	Relevant	a 22	b 2	g₁ 24
Clinician 1	Not relevant	c 4	d 11	g₂ 15
Total	f₁ 26	f₂ 13	n 39

Summing the frequencies in the main diagonal cells (cells a and d) gives the frequency of observed agreement. Dividing by n gives the proportion of observed agreement. Thus, the proportion of observed agreement in Table 1 is:

$$P_\rm o=\frac <(\it a+d\rm)><\it n>=\frac=.8462$$

The proportion of expected agreement is based on the assumption that assessments are independent between clinicians. Therefore, the frequency of chance agreement for relevance and nonrelevance of lateral shift is calculated by multiplying the marginal totals corresponding to each cell on the main diagonal and dividing by n. Summing across chance agreement in these cells and dividing by n gives the proportion of expected agreement. For the data in Table 1, this is:

Substituting into the formula: $$\kappa =\frac-\it P_\rm><1-P_\rm>=\frac=.67$$The range of possible values of kappa is from −1 to 1, though it usually falls between 0 and 1. Unity represents perfect agreement, indicating that the raters agree in their classification of every case. Zero indicates agreement no better than that expected by chance, as if the raters had simply “guessed” every rating. A negative kappa would indicate agreement worse than that expected by chance. 21 However, this rarely occurs in clinical contexts, and, when it does, the magnitude of the negative coefficient is usually small (theoretically a value of −1 can be attained if 2 raters are being considered, though with more than 2 raters the possible minimum value will be higher). 22

The kappa coefficient does not itself indicate whether disagreement is due to random differences (ie, those due to chance) or systematic differences (ie, those due to a consistent pattern) between the clinicians' ratings, 23 and the data should be examined accordingly. The Figure shows the relationship of kappa to overall and chance agreement schematically. 24

Schematic representation of the relationship of kappa to overall and chance agreement. Kappa=C/D. Adapted from Rigby.24

Figure.

Schematic representation of the relationship of kappa to overall and chance agreement. Kappa=C/D. Adapted from Rigby. 24

Adaptations of the Kappa Coefficient

The kappa coefficient can be used for scales with more than 2 categories. Richards et al 12 assessed intraobserver and interobserver agreement of radiographic classification of scoliosis in relation to the King classification system. The King system is a multicategory nominal scale by means of which radiographs of the spine can be classified into 1 of 5 types of spinal curve. However, many multicategory scales are ordinal, and in such cases it is important to retain the hierarchical nature of the categories.

Table 2 presents the results of a hypothetical reliability study of assessments of movement-related pain, on 2 occasions by a single examiner, during which time pain would not have been expected to change. The assessment categories were “no pain,” “mild pain,” “moderate pain,” and “severe pain.” These categories are clearly ordinal, in that they reflect increasing levels of movement-related pain. Here, disagreement by 1 scale point (eg, “no pain”–“mild pain”) is less serious than disagreement by 2 scale points (eg, “no pain”–“moderate pain”). To reflect the degree of disagreement, kappa can be weighted, so that it attaches greater emphasis to large differences between ratings than to small differences. A number of methods of weighting are available, 25 but quadratic weighting is common ( Appendix). Weighted kappa penalizes disagreements in terms of their seriousness, whereas unweighted kappa treats all disagreements equally. Unweighted kappa, therefore, is inappropriate for ordinal scales. 26 Because in this example most disagreements are of only a single category, the quadratic weighted kappa (.67) is higher than the unweighted kappa (.55). Different weighting schemes will produce different values of weighted kappa on the same data; for example, linear weighting gives a kappa of .61 for the data in Table 2.

Table 2.

Test-Retest Agreement of Ratings of Movement-Related Pain at the Shoulder Joint (Hypothetical Data) a

.		Test 2 .				Total .
.		No pain .	Mild pain .	Moderate pain .	Severe pain .	Total .
Test 1	No pain	15 (1) [1]	3 (.67) [.89]	1 (.33) [.56]	1 (0) [0]	20
	Mild pain	4 (.67) [.89]	18 (1)[1]	3 (.67) [.89]	2 (.33) [.56]	27
	Moderate pain	4 (.33) [.56]	5 (.67) [.89]	16 (1) [1]	4 (.67) [.89]	29
	Severe pain	1 (0) [0]	2 (.33) [.56]	4 (.67) [89]	17 (1) [1]	24
Total		24	28	24	24	100

.		Test 2 .				Total .
.		No pain .	Mild pain .	Moderate pain .	Severe pain .	Total .
Test 1	No pain	15 (1) [1]	3 (.67) [.89]	1 (.33) [.56]	1 (0) [0]	20
	Mild pain	4 (.67) [.89]	18 (1)[1]	3 (.67) [.89]	2 (.33) [.56]	27
	Moderate pain	4 (.33) [.56]	5 (.67) [.89]	16 (1) [1]	4 (.67) [.89]	29
	Severe pain	1 (0) [0]	2 (.33) [.56]	4 (.67) [89]	17 (1) [1]	24
Total		24	28	24	24	100

Figures in parentheses are linear kappa weights; figures in brackets are quadratic kappa weights. Unweighted κ=.55; linear weighted κ=.61; quadratic weighted κ=.67.

Table 2.

Test-Retest Agreement of Ratings of Movement-Related Pain at the Shoulder Joint (Hypothetical Data) a

.		Test 2 .				Total .
.		No pain .	Mild pain .	Moderate pain .	Severe pain .	Total .
Test 1	No pain	15 (1) [1]	3 (.67) [.89]	1 (.33) [.56]	1 (0) [0]	20
	Mild pain	4 (.67) [.89]	18 (1)[1]	3 (.67) [.89]	2 (.33) [.56]	27
	Moderate pain	4 (.33) [.56]	5 (.67) [.89]	16 (1) [1]	4 (.67) [.89]	29
	Severe pain	1 (0) [0]	2 (.33) [.56]	4 (.67) [89]	17 (1) [1]	24
Total		24	28	24	24	100

.		Test 2 .				Total .
.		No pain .	Mild pain .	Moderate pain .	Severe pain .	Total .
Test 1	No pain	15 (1) [1]	3 (.67) [.89]	1 (.33) [.56]	1 (0) [0]	20
	Mild pain	4 (.67) [.89]	18 (1)[1]	3 (.67) [.89]	2 (.33) [.56]	27
	Moderate pain	4 (.33) [.56]	5 (.67) [.89]	16 (1) [1]	4 (.67) [.89]	29
	Severe pain	1 (0) [0]	2 (.33) [.56]	4 (.67) [89]	17 (1) [1]	24
Total		24	28	24	24	100

Figures in parentheses are linear kappa weights; figures in brackets are quadratic kappa weights. Unweighted κ=.55; linear weighted κ=.61; quadratic weighted κ=.67.

Such weightings also can be applied to a nominal scale with 3 or more categories, if certain disagreements are considered more serious than others. Table 3 shows data for the agreement between 2 raters on the presence of a derangement, dysfunction, or postural syndrome, in terms of the classification of spinal pain originally proposed by McKenzie. 27 The value of kappa for these data is .46. Normally, in the calculation of kappa, the agreement cells (cells a, e, and i) would be given a weighting of unity, and the remaining disagreement cells would be given a weighting of zero ( Appendix). If it were felt, however, that a disagreement between a dysfunctional syndrome and a postural syndrome is of less concern clinically than a disagreement between a derangement syndrome and a dysfunctional syndrome, or between a derangement syndrome and a postural syndrome, this could be represented by applying a linear weighting to the cell frequencies. Accordingly, cells h and f would have a weight of .5, while the weights for cells b, c, d, and g would remain at zero. With this weighting, the value of kappa becomes .50. Because 16 disagreements (cells h and f) of the total of 36 disagreements are now treated as less serious through the linear weighting, kappa has increased.

Table 3.

Interrater Agreement of Ratings of Spinal Pain (Hypothetical Data) a

.		Clinician 2 .			Total .
.		Derangementsyndrome .	Dysfunctional syndrome .	Postural syndrome .	Total .
Clinician 1	Derangement syndrome	a 22	b 10	c 2	34
	Dysfunctional syndrome	d 6	e 27	f 11	44
	Postural syndrome	g 2	h 5	i 17	24
Total		30	42	30	102

.		Clinician 2 .			Total .
.		Derangementsyndrome .	Dysfunctional syndrome .	Postural syndrome .	Total .
Clinician 1	Derangement syndrome	a 22	b 10	c 2	34
	Dysfunctional syndrome	d 6	e 27	f 11	44
	Postural syndrome	g 2	h 5	i 17	24
Total		30	42	30	102

Unweighted κ=.46; cells b and d weighted as agreement κ=.50; cells f and h weighted as agreement κ=.55.

Table 3.

Interrater Agreement of Ratings of Spinal Pain (Hypothetical Data) a

.		Clinician 2 .			Total .
.		Derangementsyndrome .	Dysfunctional syndrome .	Postural syndrome .	Total .
Clinician 1	Derangement syndrome	a 22	b 10	c 2	34
	Dysfunctional syndrome	d 6	e 27	f 11	44
	Postural syndrome	g 2	h 5	i 17	24
Total		30	42	30	102

.		Clinician 2 .			Total .
.		Derangementsyndrome .	Dysfunctional syndrome .	Postural syndrome .	Total .
Clinician 1	Derangement syndrome	a 22	b 10	c 2	34
	Dysfunctional syndrome	d 6	e 27	f 11	44
	Postural syndrome	g 2	h 5	i 17	24
Total		30	42	30	102

Unweighted κ=.46; cells b and d weighted as agreement κ=.50; cells f and h weighted as agreement κ=.55.

For a nominal scale with more than 2 categories, the obtained value of kappa does not identify individual categories on which there may be either high or low agreement. 28 The use of weighting also may serve to determine the sources of disagreement between raters on a nominal scale with more than 2 categories and the effect of these disagreements on the values of kappa. 29 A cell representing a particular disagreement can be assigned a weight representing agreement (unity), effectively treating this source of disagreement as an agreement, while leaving unchanged the weights for remaining sources of disagreement. The alteration that this produces in the value of kappa serves to quantify the effect of the identified disagreement on the overall agreement, and all possible sources of disagreement can be compared in this way. Returning to the data in Table 3, if we weight as agreements those instances in which the raters disagreed between derangement and dysfunction syndromes (cells b and d), kappa rises from .46 without weighting to .50 with weighting. If alternatively we apply agreement weighting to disagreements between dysfunctional and postural syndromes (cells f and h), kappa rises more markedly to .55. As the disagreement between dysfunctional and postural syndromes produces the greater increase in kappa, it can be seen to contribute more to the overall disagreement than that between derangement and dysfunctional syndromes. This finding might indicate that differences between postural and dysfunctional syndromes are more difficult to determine than differences between derangement and dysfunctional syndromes. This information might lead to retraining of the raters or rewording of examination protocols.

In theory, kappa can be applied to ordinal categories derived from continuous data. For example, joint ranges of motion, measured in degrees, could be placed into 4 categories: “unrestricted,” “slightly restricted,” “moderately restricted,” and “highly restricted.” However, the results from such an analysis will depend largely on the choice of the category limits. As this choice is in many cases arbitrary, the value of kappa produced may have little meaning. Furthermore, this procedure involves needless sacrifice of information in the original scale and will normally give rise to a loss of statistical power. 30, 31 It is far preferable to analyze the reliability of data obtained with the original continuous scale 32 using other methods such as the intraclass correlation coefficient, 33 the standard error of measurement, 34 or the bias and limits of agreement. 35

Determinants of the Magnitude of Kappa

As previously noted, the magnitude of the kappa coefficient represents the proportion of agreement greater than that expected by chance. The interpretation of the coefficient, however, is not so straightforward, as there are other factors that can influence the magnitude of the coefficient or the interpretation that can be placed on a given magnitude. Among those factors that can influence the magnitude of kappa are prevalence, bias, and nonindependence of ratings.

Prevalence

The kappa coefficient is influenced by the prevalence of the attribute (eg, a disease or clinical sign). For a situation in which raters choose between classifying cases as either positive or negative in respect to such an attribute, a prevalence effect exists when the proportion of agreements on the positive classification differs from that of the negative classification. This can be expressed by the prevalence index. Using the notation from Table 1, this is:

$$\rm prevalence\;index = \it \frac<|a-d|>=\rm .67$$

where |a−d| is the absolute value of the difference between the frequencies of these cells (ie, ignoring the sign) and n is the number of paired ratings.

If the prevalence index is high (ie, the prevalence of a positive rating is either very high or very low), chance agreement is also high and kappa is reduced accordingly. 29 This can be shown by considering further data from Kilpikoski et al 9 on the presence or absence of lateral shift ( Tab. 4). In Table 4A, the prevalence index is high:

$$\rm prevalence\;index = \frac<|28-2|>=\rm .67$$ Table 4.

(A) Assessment of the Presence of Lateral Shift, From Kilpikoski et al 9 (κ=.18); (B) the Same Data Adjusted to Give Equal Agreements in Cells a and d, and Thus a Low Prevalence Index (κ=.54)

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 28	b 3	31	Clinician 1	Present	a 15	b 3	18
Clinician 1	Absent	c 6	d 2	8	Clinician 1	Absent	c 6	d 15	21
Total	34	5	39	21	18	39

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 28	b 3	31	Clinician 1	Present	a 15	b 3	18
Clinician 1	Absent	c 6	d 2	8	Clinician 1	Absent	c 6	d 15	21
Total	34	5	39	21	18	39

Table 4.

(A) Assessment of the Presence of Lateral Shift, From Kilpikoski et al 9 (κ=.18); (B) the Same Data Adjusted to Give Equal Agreements in Cells a and d, and Thus a Low Prevalence Index (κ=.54)

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 28	b 3	31	Clinician 1	Present	a 15	b 3	18
Clinician 1	Absent	c 6	d 2	8	Clinician 1	Absent	c 6	d 15	21
Total	34	5	39	21	18	39

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 28	b 3	31	Clinician 1	Present	a 15	b 3	18
Clinician 1	Absent	c 6	d 2	8	Clinician 1	Absent	c 6	d 15	21
Total	34	5	39	21	18	39

The proportion of chance agreement, therefore, also is relatively high (.72), and the value of kappa is .18. In Table 4B, however, there is a lower prevalence index of zero. Although the raters agree on the same number of cases (30) as in Table 4A, the low prevalence index reduces chance agreement to .50, and the value of kappa accordingly rises to .54. From Table 4B, prevalence index=|15−15|/39=0, P_o=(28+2)/39=.7692, P_c=[(21×18)/39+(18×21)/39]/39=.4970. Thus,

$$\kappa = (P_\rm-\it P_\rm)/(1-\it P_\rm)=(.7692-.4970)/(1-.4970)=.2695/.5.30=.54$$

This illustrates the first of 2 paradoxes 20 : when there is a large prevalence index, kappa is lower than when the prevalence index is low or zero. The effect of prevalence on kappa is greater for large values of kappa than for small values. 36

Bannerjee and Fielding 37 suggest that it is the true prevalence in the population that affects the magnitude of kappa. This is not wholly accurate, as the prevalence index does not provide a direct indication of the true prevalence of the disease. Rather, if a disease is either very common or very rare, this will predispose clinicians to diagnose or not to diagnose it, respectively, so that the prevalence index provides only an indirect indication of true prevalence, mediated by the clinicians' diagnostic behavior.

Because the magnitude of kappa is affected by the prevalence of the attribute, kappa on its own is difficult to interpret meaningfully unless the prevalence index is taken into account.

Bias

Bias is the extent to which the raters disagree on the proportion of positive (or negative) cases and is reflected in a difference between cells b and c in Table 1. The bias index is:

$$\rm bias\;index=\it \frac<|b-c|>$$

Bias affects our interpretation of the magnitude of the coefficient. Table 5 shows hypothetical data for 2 clinicians' diagnosis of spondylolisthesis in 100 patients. In both Table 5A and Table 5B, the proportion of cases on which the raters agree is the same, at .56, but the pattern of disagreements differs between the 2 tables because each clinician rates a differing proportion of cases as positive. In Table 5A, the proportions of cases rated as positive are .50 and .52 for clinicians 1 and 2, respectively, whereas the corresponding proportions in Table 5B are .35 and .67. In Table 5A, disagreement is close to symmetrical. The bias index is accordingly low:

$$\rm bias\;index=\frac<|23-21|>=.02$$ Table 5.

(A) Contingency Table Showing Nearly Symmetrical Disagreements in Cells b and c, and Thus a Low Bias Index (κ=.12); (B) Contingency Table With Asymmetrical Disagreements in Cells b and c, and Thus a Higher Bias Index (κ=.20) a

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 29	b 21	50	Clinician 1	Present	a 29	b 6	35
Clinician 1	Absent	c 23	d 27	50	Clinician 1	Absent	c 38	d 27	65
Total	52	48	n 100	67	33	n 100

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 29	b 21	50	Clinician 1	Present	a 29	b 6	35
Clinician 1	Absent	c 23	d 27	50	Clinician 1	Absent	c 38	d 27	65
Total	52	48	n 100	67	33	n 100

Hypothetical data for diagnoses of spondylolisthesis (“present” or “absent”) by 2 clinicians.

Table 5.

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 29	b 21	50	Clinician 1	Present	a 29	b 6	35
Clinician 1	Absent	c 23	d 27	50	Clinician 1	Absent	c 38	d 27	65
Total	52	48	n 100	67	33	n 100

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 29	b 21	50	Clinician 1	Present	a 29	b 6	35
Clinician 1	Absent	c 23	d 27	50	Clinician 1	Absent	c 38	d 27	65
Total	52	48	n 100	67	33	n 100

Hypothetical data for diagnoses of spondylolisthesis (“present” or “absent”) by 2 clinicians.

In contrast, in Table 5B the disagreements are asymmetrical. There is, therefore, a much higher bias index in Table 5B:

$$\rm bias\;index=\frac<|38-6|>=.32$$

Owing to the much greater bias in Table 5B than in Table 5A, the resulting kappa coefficients are different (.20 and .12, respectively). This gives rise to the second paradox 20 : when there is a large bias, kappa is higher than when bias is low or absent. In contrast to prevalence, the effect of bias is greater when kappa is small than when it is large. 36 Just as with prevalence, the magnitude of kappa should be interpreted in the light of the bias index.

Nonindependent Ratings

An important assumption underlying the use of the kappa coefficient is that errors associated with clinicians' ratings are independent. 38– 40 This requires the patients or subjects to be independent (so that any individual can contribute only one paired rating) and ratings to be independent (so that each observer should generate a rating without knowledge, and thus without influence, of the other observer's rating). 40 The fact that ratings are related in the sense of pertaining to the same case, however, does not contravene the assumption of independence.

The kappa coefficient, therefore, is not appropriate for a situation in which one observer is required to either confirm or disconfirm a known previous rating from another observer. In such a situation, agreement on the underlying attribute is contaminated by agreement on the assessment of that attribute, and the magnitude of kappa is liable to be inflated. Equally, as with all measures of intratester reliability, ratings on the first testing may sometimes influence those given on the second occasion, which will threaten the assumption of independence. In this way, apparent agreement may reflect more a recollection of the previous decision than a genuine judgment as to the appropriate classification. In a situation in which the clinician is doubtful as to the appropriate classification, this recollection may sway the decision in favor of agreement rather than disagreement with the previous decision. Thus, “agreements” that represent a decision to classify in the same way will be added to agreements on the actual attribute. This will tend to increase the value of kappa. 38

Accordingly, studies of either interrater or intrarater reliability should be designed in such a way that ratings are, as far as possible, independent, otherwise kappa values may be inappropriately inflated. Equally, where a study appears not to have preserved independence between ratings, kappa should be interpreted cautiously. Strictly, there will always be some degree of dependence between ratings in an intrarater study. 38 Various strategies can be used, however, to minimize this dependence. The time interval between repeat ratings is important. If the interval is too short, the rater might remember the previously recorded rating; if the interval is too long, then the attribute under examination might have changed. Streiner and Norman 33 stated that an interval of 2 to 14 days is usual, but this will depend on the attribute being measured. Stability of the attribute being rated is crucial to the period between repeated ratings. Thus, trait attributes pose fewer problems for intrarater assessment (because longer periods of time may be left between ratings) than state attributes, which are more labile. Some suggestions to overcome the bias due to memory include: having as long a time period as possible between repeat examinations, blinding raters to their first rating (although this might be easier with numerical data than with diagnostic categories), and different random ordering of patients or subjects on each rating occasion and for each rater.

Adjusting Kappa

Because both prevalence and bias play a part in determining the magnitude of the kappa coefficient, some statisticians have devised adjustments to take account of these influences. 36 Kappa can be adjusted for high or low prevalence by computing the average of cells a and d and substituting this value for the actual values in those cells. Similarly, an adjustment for bias is achieved by substituting the mean of cells b and c for those actual cell values. The kappa coefficient that results is referred to as PABAK (prevalence-adjusted bias-adjusted kappa). Table 6A shows data from Kilpikoski et al 9 for assessments of directional preference (ie, the direction of movement that reduces or abolishes pain) in patients evaluated according to the McKenzie system; kappa for these data is .54. When the cell frequencies are adjusted to minimize prevalence and bias, this gives the cell values shown in Table 6B, with a PABAK of .79.

Table 6.

(A) Data Reported by Kilpikoski et al 9 for Judgments of Directional Preference by 2 Clinicians (κ=.54); (B) Cell Frequencies Adjusted to Minimize Prevalence and Bias Effects, Giving a Prevalence-Adjusted Bias-Adjusted κ of .79

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 32	b 1	33	Clinician 1	Present	a 18	b 2	20
Clinician 1	Absent	c 3	d 3	6	Clinician 1	Absent	c 2	d 17	19
Total	35	4	39	20	19	39

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 32	b 1	33	Clinician 1	Present	a 18	b 2	20
Clinician 1	Absent	c 3	d 3	6	Clinician 1	Absent	c 2	d 17	19
Total	35	4	39	20	19	39

Table 6.

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 32	b 1	33	Clinician 1	Present	a 18	b 2	20
Clinician 1	Absent	c 3	d 3	6	Clinician 1	Absent	c 2	d 17	19
Total	35	4	39	20	19	39

A .		Clinician 2 .		Total .	B .		Clinician 2 .		Total .
A .		Present .	Absent .	Total .	B .		Present .	Absent .	Total .
Clinician 1	Present	a 32	b 1	33	Clinician 1	Present	a 18	b 2	20
Clinician 1	Absent	c 3	d 3	6	Clinician 1	Absent	c 2	d 17	19
Total	35	4	39	20	19	39

Hoehler 41 is critical of the use of PABAK because he believes that the effects of bias and prevalence on the magnitude of kappa are themselves informative and should not be adjusted for and thereby disregarded. Thus, the PABAK could be considered to generate a value for kappa that does not relate to the situation in which the original ratings were made. Table 6B represents very different diagnostic behavior from Table 6A, as indicated by the change in the marginal totals. Furthermore, all of the frequencies in the cells have changed between Table 6A and Table 6B.

Therefore, the PABAK coefficient on its own is uninformative because it relates to a hypothetical situation in which no prevalence or bias effects are present. However, if PABAK is presented in addition to, rather than in place of, the obtained value of kappa, its use may be considered appropriate because it gives an indication of the likely effects of prevalence and bias alongside the true value of kappa derived from the specific measurement context studied. Cicchetti and Feinstein 42 argued, in a similar vein to Hoehler, 41 that the effects of the prevalence and bias “penalize” the value of kappa in an appropriate manner. However, they also stated that a single “omnibus” value of kappa is difficult to interpret, especially when trying to diagnose the possible cause of an apparent lack of agreement. Byrt et al 36 recommended that the prevalence index and bias index should be given alongside kappa, and other authors 42, 43 have suggested that the separate proportions of positive and negative agreements should be quoted as a means of alerting the reader to the possibility of prevalence or bias effects. Similarly, Gjørup 44 suggested that kappa values should be accompanied by the original data in a contingency table.

Interpreting the Magnitude of Kappa

Landis and Koch 45 have proposed the following as standards for strength of agreement for the kappa coefficient: ≤0=poor, .01–.20=slight, .21–.40=fair, .41–.60=moderate, .61–.80=substantial, and .81–1=almost perfect. Similar formulations exist, 46– 48 but with slightly different descriptors. The choice of such benchmarks, however, is inevitably arbitrary, 29, 49 and the effects of prevalence and bias on kappa must be considered when judging its magnitude. In addition, the magnitude of kappa is influenced by factors such as the weighting applied and the number of categories in the measurement scale. 32, 49– 51 When weighted kappa is used, the choice of weighting scheme will affect its magnitude ( Appendix). The larger the number of scale categories, the greater the potential for disagreement, with the result that unweighted kappa will be lower with many categories than with few. 32 If quadratic weighting is used, however, kappa increases with the number of categories, and this is most marked in the range from 2 to 5 categories. 50 For linear weighting, kappa varies much less with the number of categories than for quadratic weighting, and may increase or decrease with the number of categories, depending on the distribution of the underlying trait. 50 Caution, therefore, should be exercised when comparing the magnitude of kappa across variables that have different prevalence or bias or that are measured on dissimilar scales or across situations in which different weighting schemes have been applied to kappa.

Dunn 49 suggested that interpretation of kappa is assisted by also reporting the maximum value it could attain for the set of data concerned. To calculate the maximum attainable kappa (κ_max), the proportions of positive and negative judgments by each clinician (ie, the marginal totals) are taken as fixed, and the distribution of paired ratings (ie, the cell frequencies a, b, c, and d in Tab. 1) is then adjusted so as to represent the greatest possible agreement. Table 7 illustrates this process, using data from a study of therapists' examination of passive cervical intervertebral motion. 7 Ratings were on a 2-point scale (ie, “stiffness”/“no stiffness”). Table 7 shows that clinician 1 judged stiffness to be present in 3 subjects, whereas clinician 2 arrived at a figure of 9. Thus, the maximum possible agreement on stiffness is limited to 3 subjects, rather than the actual figure of 2. Similarly, clinician 1 judged 57 subjects to have no stiffness, compared with 51 subjects judged by clinician 2 to have no stiffness; therefore, for “no stiffness,” the maximum agreement possible is 51 subjects, rather than 50. That is, the maximum possible agreement for either presence or absence of the disease is the smaller of the marginal totals in each case. The remaining 6 ratings (60 − [3 + 51] = 6) are allocated to the cells that represent disagreement, in order to maintain the marginal total; thus, these ratings are allocated to cell c. For these data, κ_max is .46, compared with a kappa of .28.

Table 7.

Data on Assessments of Stiffness at C1-2 From Smedmark et al 7 , a