Aggregating Soft Labels from Crowd Annotations Improves Uncertainty Estimation Under Distribution Shift

Dustin Wright; Isabelle Augenstein

Aggregating Soft Labels from Crowd Annotations Improves Uncertainty Estimation Under Distribution Shift

Dustin Wright, Isabelle Augenstein

TL;DR

The paper addresses uncertainty estimation under distribution shift when learning from crowd-sourced labels. It conducts a large-scale comparison of eight soft-labeling methods across four tasks and shows that simple aggregation of soft-label posteriors yields more reliable uncertainty estimates while preserving strong raw performance. No single method dominates across all tasks, but averaging soft-labels improves calibration and out-of-domain robustness, especially for subjective tasks and moderate data regimes. The work provides practical guidance and code to leverage crowd annotations more effectively in real-world, deployment-focused settings.

Abstract

Selecting an effective training signal for machine learning tasks is difficult: expert annotations are expensive, and crowd-sourced annotations may not be reliable. Recent work has demonstrated that learning from a distribution over labels acquired from crowd annotations can be effective both for performance and uncertainty estimation. However, this has mainly been studied using a limited set of soft-labeling methods in an in-domain setting. Additionally, no one method has been shown to consistently perform well across tasks, making it difficult to know a priori which to choose. To fill these gaps, this paper provides the first large-scale empirical study on learning from crowd labels in the out-of-domain setting, systematically analyzing 8 soft-labeling methods on 4 language and vision tasks. Additionally, we propose to aggregate soft-labels via a simple average in order to achieve consistent performance across tasks. We demonstrate that this yields classifiers with improved predictive uncertainty estimation in most settings while maintaining consistent raw performance compared to learning from individual soft-labeling methods or taking a majority vote of the annotations. We additionally highlight that in regimes with abundant or minimal training data, the selection of soft labeling method is less important, while for highly subjective labels and moderate amounts of training data, aggregation yields significant improvements in uncertainty estimation over individual methods. Code can be found at https://github.com/copenlu/aggregating-crowd-annotations-ood.

Aggregating Soft Labels from Crowd Annotations Improves Uncertainty Estimation Under Distribution Shift

TL;DR

Abstract

Paper Structure (20 sections, 16 equations, 16 figures, 4 tables)

This paper contains 20 sections, 16 equations, 16 figures, 4 tables.

Learning from Crowd-Sourced Labels
Learning from Soft Labels
Standard Normalization
Softmax Normalization
Worker Agreement with Aggregate (Wawa)
ZeroBasedSkill (ZBS)
Dawid & Skene (DS)
Generative model of Labels, Abilities, and Difficulties (GLAD)
MACE
Recognizing Textual Entailment (RTE)
Part-of-Speech Tagging (POS)
Toxicity Detection
Image Classification
Recognizing Textual Entailment (RTE)
Part-of-Speech Tagging (POS)
...and 5 more sections

Figures (16)

Figure 1: Significance testing for the RTE task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Figure 2: Significance testing for the POS task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Figure 3: Significance testing for the Toxicity task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Figure 4: Significance testing for the Image Cls. task. We apply the Bonferroni correction across the number of independent variables (N = 7; a more conservative estimate across the total tests N = 56 can be found in the supplemental information). Green indicates the method in the row is significantly better than the method in the column. Red indicates the method in the row is significantly worse than the method in the column. Grey indicates no statistically significant difference.
Figure 5: Comparison of the average CLL and F1 score on the RTE task using different combinations of distributions for aggreagation. Points are the average performance across all combinations of a given number of distributions and error bars are 95% confidence intervals.
...and 11 more figures

Aggregating Soft Labels from Crowd Annotations Improves Uncertainty Estimation Under Distribution Shift

TL;DR

Abstract

Aggregating Soft Labels from Crowd Annotations Improves Uncertainty Estimation Under Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (16)