Table of Contents
Fetching ...

Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung, Daniil Ignatev, Merel Scholman, Vera Demberg, Massimo Poesio

TL;DR

Comparisons on Implicit Discourse Relation Recognition show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions.

Abstract

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

Human Label Variation in Implicit Discourse Relation Recognition

TL;DR

Comparisons on Implicit Discourse Relation Recognition show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions.

Abstract

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.
Paper Structure (23 sections, 3 figures, 7 tables)

This paper contains 23 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Mean nPMI values between the worker and majority labels for the three clusters of workers with different levels of agreement with the majority: Cluster 0 (low agreement), Cluster 1 (medium agreement), Cluster 2 (high agreement).
  • Figure 2: Bias of four workers compared with the majority label based on nPMI. darker colors mean more divergence from the majority, where blue means higher tendency (positive nPMI) and red means lower tendency (negative nPMI). Abbreviations: temporal, contingency, comparison, expansion, no relation.
  • Figure 3: Confusion matrices of worker-specific prediction of the above workers. Gold refers to the original labels. Abbreviations: temporal, contingency, comparison, expansion, no relation.