Table of Contents
Fetching ...

Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement

Gavin Abercrombie, Tanvi Dinkar, Amanda Cercas Curry, Verena Rieser, Dirk Hovy

TL;DR

This paper addresses label variation in NLP by arguing for the routine reporting of intra-annotator agreement to quantify label stability over time alongside traditional inter-annotator metrics. It introduces the reliability-stability matrix to interpret how inter- and intra-annotator agreement jointly reflect task ambiguity, subjectivity, and data quality. Through a systematic review, it finds intra-annotator agreement is rarely reported (MT being the notable exception) and presents exploratory longitudinal annotation experiments across four tasks, revealing substantial within-annotator inconsistency in over a quarter of items. The work highlights the utility of measuring intra-annotator agreement for quality control and interpretation of disagreements, while acknowledging limitations of the small-scale study and the need for larger, more diverse investigations.

Abstract

We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability (and annotator consistency) over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and could provide insights into why annotators disagree. We conduct exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items, finding that annotators provide inconsistent responses around 25% of the time across four different NLP tasks.

Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement

TL;DR

This paper addresses label variation in NLP by arguing for the routine reporting of intra-annotator agreement to quantify label stability over time alongside traditional inter-annotator metrics. It introduces the reliability-stability matrix to interpret how inter- and intra-annotator agreement jointly reflect task ambiguity, subjectivity, and data quality. Through a systematic review, it finds intra-annotator agreement is rarely reported (MT being the notable exception) and presents exploratory longitudinal annotation experiments across four tasks, revealing substantial within-annotator inconsistency in over a quarter of items. The work highlights the utility of measuring intra-annotator agreement for quality control and interpretation of disagreements, while acknowledging limitations of the small-scale study and the need for larger, more diverse investigations.

Abstract

We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability (and annotator consistency) over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and could provide insights into why annotators disagree. We conduct exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items, finding that annotators provide inconsistent responses around 25% of the time across four different NLP tasks.
Paper Structure (21 sections, 1 figure, 8 tables)