Table of Contents
Fetching ...

Assessing agreement on classification tasks: the kappa statistic

Jean Carletta

TL;DR

The paper critiques current reliability measures used in discourse and dialogue coding for being non-interpretable and sensitive to category structure. It advocates adopting the kappa statistic from content analysis to provide a chance-corrected, interpretable measure of intercoder reliability, with guidance on interpretation and comparability. It also discusses the role of naive versus expert coders and proposes practical adjustments, arguing that standardized reliability metrics will enhance cross-study comparability and the overall robustness of discourse research.

Abstract

Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.

Assessing agreement on classification tasks: the kappa statistic

TL;DR

The paper critiques current reliability measures used in discourse and dialogue coding for being non-interpretable and sensitive to category structure. It advocates adopting the kappa statistic from content analysis to provide a chance-corrected, interpretable measure of intercoder reliability, with guidance on interpretation and comparability. It also discusses the role of naive versus expert coders and proposes practical adjustments, arguing that standardized reliability metrics will enhance cross-study comparability and the overall robustness of discourse research.

Abstract

Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.

Paper Structure

This paper contains 6 sections, 1 equation.