Table of Contents
Fetching ...

Soft-Label Training Preserves Epistemic Uncertainty

Agamdeep Singh, Ashish Tiwari, Hosein Hasanbeig, Priyanshu Gupta

TL;DR

The paper argues that annotation distributions captured by multiple human judgments should be treated as ground truth in genuinely ambiguous data, rather than collapsed to a single label. It demonstrates that soft-label training, which targets the full distribution, preserves epistemic uncertainty and improves alignment with human perception without sacrificing accuracy. Across NLP and vision tasks (ChaosNLI, POPQUORN, CIFAR-10H), soft-label models achieve lower $D_{KL}$ divergences from human annotations and show stronger correlations between model uncertainty and data uncertainty (average 61% improvement). The findings suggest practical benefits for calibration, robustness, and trustworthy AI, particularly when data ambiguity is substantial and annotations are plentiful enough to estimate distributions."

Abstract

Many machine learning tasks involve inherent subjectivity, where annotators naturally provide varied labels. Standard practice collapses these label distributions into single labels, aggregating diverse human judgments into point estimates. We argue that this approach is epistemically misaligned for ambiguous data--the annotation distribution itself should be regarded as the ground truth. Training on collapsed single labels forces models to express false confidence on fundamentally ambiguous cases, creating a misalignment between model certainty and the diversity of human perception. We demonstrate empirically that soft-label training, which treats annotation distributions as ground truth, preserves epistemic uncertainty. Across both vision and NLP tasks, soft-label training achieves 32% lower KL divergence from human annotations and 61% stronger correlation between model and annotation entropy, while matching the accuracy of hard-label training. Our work repositions annotation distributions from noisy signals to be aggregated away, to faithful representations of epistemic uncertainty that models should learn to reproduce.

Soft-Label Training Preserves Epistemic Uncertainty

TL;DR

The paper argues that annotation distributions captured by multiple human judgments should be treated as ground truth in genuinely ambiguous data, rather than collapsed to a single label. It demonstrates that soft-label training, which targets the full distribution, preserves epistemic uncertainty and improves alignment with human perception without sacrificing accuracy. Across NLP and vision tasks (ChaosNLI, POPQUORN, CIFAR-10H), soft-label models achieve lower divergences from human annotations and show stronger correlations between model uncertainty and data uncertainty (average 61% improvement). The findings suggest practical benefits for calibration, robustness, and trustworthy AI, particularly when data ambiguity is substantial and annotations are plentiful enough to estimate distributions."

Abstract

Many machine learning tasks involve inherent subjectivity, where annotators naturally provide varied labels. Standard practice collapses these label distributions into single labels, aggregating diverse human judgments into point estimates. We argue that this approach is epistemically misaligned for ambiguous data--the annotation distribution itself should be regarded as the ground truth. Training on collapsed single labels forces models to express false confidence on fundamentally ambiguous cases, creating a misalignment between model certainty and the diversity of human perception. We demonstrate empirically that soft-label training, which treats annotation distributions as ground truth, preserves epistemic uncertainty. Across both vision and NLP tasks, soft-label training achieves 32% lower KL divergence from human annotations and 61% stronger correlation between model and annotation entropy, while matching the accuracy of hard-label training. Our work repositions annotation distributions from noisy signals to be aggregated away, to faithful representations of epistemic uncertainty that models should learn to reproduce.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Annotator entropy varies across samples. From near-consensus (a) to substantial disagreement (b). This variation signals that uncertainty is sample-dependent, not uniform noise, suggesting different data points carry fundamentally different levels of epistemic ambiguity. Images from CIFAR-10H cifar10h.
  • Figure 2: Both models predict the same majority class, yet their uncertainty profiles diverge sharply (KL=0.01 vs 0.08). This demonstrates that soft-label training captures the full annotation distribution rather than collapsing to artificial certainty—producing predictions that acknowledge ambiguity when it exists.
  • Figure 3: Entropy distributions across datasets. Entropy is normalized for number of classes.
  • Figure 4: Validation loss for soft-label (blue) and hard-label (orange) models. Loss magnitudes differ due to different target representations (distributions vs. one-hot) but training dynamics are revealing: soft-label models maintain stable or improving validation performance longer, while hard-label models plateau or degrade earlier.
  • Figure 5: On high-entropy samples, soft-label predictions (orange) mirror the spread of human annotations (blue), while hard-label predictions (green) collapse to single peaks despite genuine disagreement. This shows that training method directly shapes whether models express or suppress uncertainty—affecting trustworthiness on fundamentally ambiguous inputs.