Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung; Daniil Ignatev; Merel Scholman; Vera Demberg; Massimo Poesio

Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung, Daniil Ignatev, Merel Scholman, Vera Demberg, Massimo Poesio

TL;DR

Comparisons on Implicit Discourse Relation Recognition show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions.

Abstract

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

Human Label Variation in Implicit Discourse Relation Recognition

TL;DR

Abstract

Paper Structure (23 sections, 3 figures, 7 tables)

This paper contains 23 sections, 3 figures, 7 tables.

Introduction
Related work
Disagreement and perspectives in annotation
Annotator-specific label prediction models
Perspectives in IDRR
Experiments
Data
Models
Single truth (ST) model
Soft label models
Perspectivist models
Results
Single-label prediction
Label-distribution prediction
Annotator-specific label prediction
...and 8 more sections

Figures (3)

Figure 1: Mean nPMI values between the worker and majority labels for the three clusters of workers with different levels of agreement with the majority: Cluster 0 (low agreement), Cluster 1 (medium agreement), Cluster 2 (high agreement).
Figure 2: Bias of four workers compared with the majority label based on nPMI. darker colors mean more divergence from the majority, where blue means higher tendency (positive nPMI) and red means lower tendency (negative nPMI). Abbreviations: temporal, contingency, comparison, expansion, no relation.
Figure 3: Confusion matrices of worker-specific prediction of the above workers. Gold refers to the original labels. Abbreviations: temporal, contingency, comparison, expansion, no relation.

Human Label Variation in Implicit Discourse Relation Recognition

TL;DR

Abstract

Human Label Variation in Implicit Discourse Relation Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)