Table of Contents
Fetching ...

Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning

Taylor Sorensen, Yejin Choi

TL;DR

Opt-ICL combines in-context learning with two-stage meta-learning to model annotator disagreement across LeWiDi tasks. By Spectrum Tuning, dataset-specific training, and careful in-context inference that leverages rater demonstrations, the system achieves strong performance and is reported as the overall winner on both tasks. Key findings show in-context rater examples are crucial, larger datasets benefit from dataset-specific tuning, Spectrum Tuning helps on at least one dataset, and model scale aids performance but cannot replace targeted training. The work advances practical methods for modeling human variation in NLP and informs robust evaluation and calibration under disagreement.

Abstract

Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models' (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system's performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.

Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning

TL;DR

Opt-ICL combines in-context learning with two-stage meta-learning to model annotator disagreement across LeWiDi tasks. By Spectrum Tuning, dataset-specific training, and careful in-context inference that leverages rater demonstrations, the system achieves strong performance and is reported as the overall winner on both tasks. Key findings show in-context rater examples are crucial, larger datasets benefit from dataset-specific tuning, Spectrum Tuning helps on at least one dataset, and model scale aids performance but cannot replace targeted training. The work advances practical methods for modeling human variation in NLP and informs robust evaluation and calibration under disagreement.

Abstract

Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models' (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system's performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.

Paper Structure

This paper contains 24 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Ablation study results. Perspectivist Task: For MP/VEN, error rate is reported, and for CSC/Par, absolute distance is reported (lower is better for both). Soft Task: For MP/VEN, Manhattan distance is reported, and for CSC/Par, Wasserstein distance is reported (lower is better for both). Error bars indicate 95% confidence intervals, computed as $\pm$ 1.96 times the standard error of the mean of instance-level scores. Our system performance is shown as a solid line, and the best competing team performance is shown as a dashed line.