Three related concepts: Interobserver Agreement, Interobserver Correlation, and Interobserver Reliability Coefficient

Relationship Between Interobserver Agreement, Interobserver Correlation, and Interobserver Reliability Coefficient

Percent Agreement

Interobserver Reliability Correlation

(r)

Interobserver Reliability Coefficient

(r2)

Example of Data With This Degree of Interobserver Agreement

100%

1

1

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

1

7

0

0

8

1

1

9

0

0

10

1

1

90%

0.8

0.64

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

1

7

0

0

8

1

1

9

0

0

10

1

0

80%

0.6

0.36

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

1

7

0

0

8

1

1

9

0

1

10

1

0

70%

0.4

0.17

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

1

7

0

0

8

1

0

9

0

1

10

1

0

60%

0.2

0.04

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

1

7

0

1

8

1

0

9

0

1

10

1

0

50%

0

0

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

0

6

1

0

7

0

1

8

1

0

9

0

1

10

1

0

40%

-0.2

0.04

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

1

5

0

1

6

1

0

7

0

1

8

1

0

9

0

1

10

1

0

30%

-0.4

0.17

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

0

4

1

0

5

0

1

6

1

0

7

0

1

8

1

0

9

0

1

10

1

0

20%

-0.6

0.36

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

1

3

0

1

4

1

0

5

0

1

6

1

0

7

0

1

8

1

0

9

0

1

10

1

0

10%

-0.8

0.64

Participant

Observer 1’s score

Observer 2’s score

1

0

0

2

1

0

3

0

1

4

1

0

5

0

1

6

1

0

7

0

1

8

1

0

9

0

1

10

1

0

0%

-1

1

Participant

Observer 1’s score

Observer 2’s score

1

1

0

2

0

1

3

1

0

4

0

1

5

1

0

6

0

1

7

1

0

8

0

1

9

1

0

10

0

1

 

Imagine that observers (raters) are judging whether comments are sexist (1) or not (0). Note that if observers made their judgments by flipping a coin, they would still agree 50% of the time—and the inter-observer reliability coefficient would be 0. Thus, 50% agreement would not be good reliability. Indeed, even  80% agreement would result in an interobserver reliability coefficient of only .36—what most social scientists would consider poor reliability.

What if the observers agreed substantially less than 50% of the time? Then, the observers are disagreeing rather than agreeing. To illustrate, look at the case in which observers agreed 0% of the time. As you can see, they are not just failing to agree; they are disagreeing: Specifically, whenever one observer judges a behavior to be sexist, the other always judges it as nonsexist. The fact that the observers are making opposite ratings is reflected by the negative correlation between raters. Note, however, that this complete disagreement results in the same  interobserver reliability coefficient as perfect agreement: 1. Although it seems strange that the interrater reliability coefficient for raters who disagree all the time is the same as the interrater reliability coefficient for raters who agree all the time, it makes some sense because, in both cases, observers’ ratings are not differing bychance.

That interobserver reliability coefficients from raters who consistently disagree may be just as high as the interobserver reliability coefficients from  raters who consistently agree is usually not a  problem because raters usually agree—it is usually just a matter of how consistently they agree. Consequently, you will usually get a positive correlation—not a negative correlation—between observers’ ratings. If you were to obtain a negative correlation, it might be because raters misunderstood the scale (e.g., if they are rating on a 1-5 scale, one thinks that “5” means “good” whereas the other thinks “5” means “poor.”). In short, if you have a negative correlation between raters, you (a) have a problem and (b) should probably not report the inter-rater reliability coefficient (because readers would assume that your coefficient was based on a positive correlation).

 

 




Back to Chapter 5 Menu