Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It's Best to Relate Perspectives!

Philipp Heinisch; Matthias Orlikowski; Julia Romberg; Philipp Cimiano

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It's Best to Relate Perspectives!

Philipp Heinisch, Matthias Orlikowski, Julia Romberg, Philipp Cimiano

TL;DR

This paper tackles subjectivity in annotation by modeling both shared grounding and individual annotator perspectives in argument quality tasks. It defines a spectrum of architectures, from annotator-agnostic majority labeling to annotator-specific heads and recommender-inspired models that relate annotator behavior, including ShareREC and SepREC. Empirically, recommender-based designs yield the strongest annotator-level performance on two datasets (Concreteness and ValNov), with gains up to $43\%$ over majority baselines and notable improvements in hard-case labeling. The work provides a practical framework for incorporating human label variation into subjective NLP tasks, enabling more nuanced and user-adaptable argument-quality classifiers while highlighting trade-offs between shared representations and annotator-specific nuances.

Abstract

Many annotation tasks in natural language processing are highly subjective in that there can be different valid and justified perspectives on what is a proper label for a given example. This also applies to the judgment of argument quality, where the assignment of a single ground truth is often questionable. At the same time, there are generally accepted concepts behind argumentation that form a common ground. To best represent the interplay of individual and shared perspectives, we consider a continuum of approaches ranging from models that fully aggregate perspectives into a majority label to "share nothing"-architectures in which each annotator is considered in isolation from all other annotators. In between these extremes, inspired by models used in the field of recommender systems, we investigate the extent to which architectures that include layers to model the relations between different annotators are beneficial for predicting single-annotator labels. By means of two tasks of argument quality classification (argument concreteness and validity/novelty of conclusions), we show that recommender architectures increase the averaged annotator-individual F$_1$-scores up to $43\%$ over a majority label model. Our findings indicate that approaches to subjectivity can benefit from relating individual perspectives.

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It's Best to Relate Perspectives!

TL;DR

over majority baselines and notable improvements in hard-case labeling. The work provides a practical framework for incorporating human label variation into subjective NLP tasks, enabling more nuanced and user-adaptable argument-quality classifiers while highlighting trade-offs between shared representations and annotator-specific nuances.

Abstract

-scores up to

over a majority label model. Our findings indicate that approaches to subjectivity can benefit from relating individual perspectives.

Paper Structure (25 sections, 1 equation, 1 figure, 8 tables)

This paper contains 25 sections, 1 equation, 1 figure, 8 tables.

Introduction
Related Work
Subjectivity & Modeling Individual Annotators
Subjectivity in Argument Mining
Methodology
Two poles: annotator-specific and annotator-agnostic approaches
Approaches between annotator-specific and annotator-agnostic approaches
Annotator-specific classification head
Recommender-system inspired models
Experiment Design
Datasets
CIMT Argument Concreteness Dataset (abbr. Concreteness, romberg-etal-2022-corpus)
Argument Validity and Novelty Prediction Shared Task (abbr. ValNov, heinisch-etal-2022-overview)
Experimental Setup
Evaluation Metrics
...and 10 more sections

Figures (1)

Figure 1: Overview of our different approaches modeling annotator-(a)gnostic behaviors.

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It's Best to Relate Perspectives!

TL;DR

Abstract

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It's Best to Relate Perspectives!

Authors

TL;DR

Abstract

Table of Contents

Figures (1)