Table of Contents
Fetching ...

VariErr NLI: Separating Annotation Error from Human Label Variation

Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

TL;DR

The paper tackles the coexistence of annotation errors and human label variation in NLP benchmarks by introducing VariErr, a two-round NLI annotation protocol that pairs label-explanation pairs with validity judgments. This approach, grounded in ecologically valid explanations, enables distinguishing errors from plausible variation and yields a dataset of 1,933 label-explanation pairs and 7,732 validity judgments across 500 MNLI items. Through extensive evaluation of automatic error detectors (Datamaps, Metadata Archaeology, GPT-3.5/4) and human heuristics, the study shows that GPTs and humans outperform traditional AED methods, with GPT-4 approaching human performance but not surpassing it. The results highlight the value of combining human judgments, explanations, and model-based signals to improve data quality, and the methodology is extensible to tasks beyond NLI for more trustworthy NLP systems.

Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

VariErr NLI: Separating Annotation Error from Human Label Variation

TL;DR

The paper tackles the coexistence of annotation errors and human label variation in NLP benchmarks by introducing VariErr, a two-round NLI annotation protocol that pairs label-explanation pairs with validity judgments. This approach, grounded in ecologically valid explanations, enables distinguishing errors from plausible variation and yields a dataset of 1,933 label-explanation pairs and 7,732 validity judgments across 500 MNLI items. Through extensive evaluation of automatic error detectors (Datamaps, Metadata Archaeology, GPT-3.5/4) and human heuristics, the study shows that GPTs and humans outperform traditional AED methods, with GPT-4 approaching human performance but not surpassing it. The results highlight the value of combining human judgments, explanations, and model-based signals to improve data quality, and the methodology is extensible to tasks beyond NLI for more trustworthy NLP systems.

Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.
Paper Structure (36 sections, 1 equation, 6 figures, 5 tables)

This paper contains 36 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Variation or Error? We present a procedure and multi-label dataset, VariErr, to tease apart annotation error from plausible human label variation. We leverage ecologically valid explanations and validation as two key mechanisms (boxed: self-validations; label "Contradiction" is an error); see §\ref{['sec:dataset']}-§\ref{['sec:validating-explanations-label']} for details.
  • Figure 2: Frequency statistics on VariErr.
  • Figure 3: Correlations among scorer predictions.
  • Figure 4: Average distribution of erroneous, HLV, and other labels over the top 100 instances per method.
  • Figure 5: Frequency of NLI label sets on non-, self- and peer-validated label-explanation pairs.
  • ...and 1 more figures