Table of Contents
Fetching ...

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

Elisa Leonardelli, Silvia Casola, Siyao Peng, Giulia Rizzi, Valerio Basile, Elisabetta Fersini, Diego Frassinelli, Hyewon Jang, Maja Pavlovic, Barbara Plank, Massimo Poesio

TL;DR

LeWiDi-2025 tackles modeling disagreement in NLP by broadening to four text-based tasks (CSC, MP, VEN, Par) and introducing dual evaluation paradigms: soft-label prediction and perspectivist prediction. It adopts ordinal labeling and new metrics (Manhattan and Wasserstein distances for soft-labels; AER and ANAD for perspectivist judgments), enabling robust evaluation across multiclass, multilabel, and ordinal settings. The study demonstrates that unified, annotator-aware pipelines—often leveraging annotator demonstrations and demographic information—achieve strong performance, with LLM-based systems excelling in many cases while fine-tuned transformers remain competitive. Overall, the work provides new resources, benchmarks, and insights to advance disagreement-aware NLP, highlighting both methodological gains and remaining challenges in annotator generalization and cross-task transfer.

Abstract

Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

TL;DR

LeWiDi-2025 tackles modeling disagreement in NLP by broadening to four text-based tasks (CSC, MP, VEN, Par) and introducing dual evaluation paradigms: soft-label prediction and perspectivist prediction. It adopts ordinal labeling and new metrics (Manhattan and Wasserstein distances for soft-labels; AER and ANAD for perspectivist judgments), enabling robust evaluation across multiclass, multilabel, and ordinal settings. The study demonstrates that unified, annotator-aware pipelines—often leveraging annotator demonstrations and demographic information—achieve strong performance, with LLM-based systems excelling in many cases while fine-tuned transformers remain competitive. Overall, the work provides new resources, benchmarks, and insights to advance disagreement-aware NLP, highlighting both methodological gains and remaining challenges in annotator generalization and cross-task transfer.

Abstract

Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

Paper Structure

This paper contains 40 sections, 7 equations, 8 tables.