Table of Contents
Fetching ...

Confidence Calibration under Ambiguous Ground Truth

Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu

Abstract

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.

Confidence Calibration under Ambiguous Ground Truth

Abstract

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.
Paper Structure (43 sections, 3 theorems, 8 equations, 33 figures, 7 tables)

This paper contains 43 sections, 3 theorems, 8 equations, 33 figures, 7 tables.

Key Result

Proposition 1

Consider a calibration set containing an ambiguous cluster in which the voted label $y^*$ equals the majority class for every example, but the true annotator probability of $y^*$ is $\pi_{y^*}(x)=q < 1$. Under the assumption that the model is already reasonably accurate on this cluster ($\operatorna

Figures (33)

  • Figure 1: Left: a CIFAR-10H image with dispersed human votes (cat/dog/bird), illustrating perceptual ambiguity in low-resolution vision. Middle: a ChaosNLI premise--hypothesis pair with split entailment/neutral judgments, illustrating semantic ambiguity. Right: an ISIC 2019 melanoma image paired with the clinician-informed label distribution used in our medical experiments, showing clinically plausible MEL/NV confusion. CIFAR-10H and ChaosNLI distributions are empirical human label distributions; the ISIC panel uses the dermatologist confusion model described in Appendix \ref{['app:isic-confusion']}.
  • Figure 2: Motivating example: all standard calibration methods fail.(a) Toy dataset generation: only the middle Gaussian cluster is ambiguous, with $\pi=[0,0.70,0.30]$; orange points are drawn as label 1 (70%) and red points as label 2 (30%), while all receive the same voted label 1. (b) Summary of the toy results: all three voted-label calibrators (TS, Platt, Histogram Binning) lower $\mathrm{ECE}_{\text{voted}}$ but increase$\mathrm{ECE}_{\text{true}}$, so voted-label evaluation masks the failure. (c) Stratified $\mathrm{ECE}_{\text{true}}$ for TS, Platt, and Histogram Binning: ambiguous examples are those from the middle Gaussian cluster (where annotators disagree); clear examples are those from the two unambiguous clusters (class 0 and class 2). The residual error is concentrated in ambiguous examples for all three methods.
  • Figure 3: Empirical validation of Proposition \ref{['prop:entropy-gap']}. Test examples are grouped into equal-frequency bins by normalised annotation entropy $H(x)/\log K$. Each point shows the mean pointwise true-label calibration error $|\hat{p}_{\hat{c}}(x)-\pi_{\hat{c}}(x)|$ for Temperature Scaling (error bars: $\pm 1$ s.e.). Across all three dataset--architecture combinations, TS error increases monotonically with annotation entropy, as predicted by Proposition \ref{['prop:entropy-gap']}. ChaosNLI operates in a higher-entropy regime and exhibits uniformly higher error ($19$--$25\%$).
  • Figure 4: Reliability diagrams for four representative methods: CIFAR-10H ResNet-50 (ECE$_\text{true}$). Red shading indicates overconfidence; green indicates underconfidence. Uncal and TS remain overconfident; soft-label temperature methods (MCTS/SLTS) substantially correct this; Dirichlet-Soft reduces the residual gap further. Reliability diagrams for all remaining methods are provided in Appendix \ref{['app:reliability']}.
  • Figure 5: Reliability diagrams for four representative methods: ChaosNLI RoBERTa-Large (ECE$_\text{true}$). Same method selection as Figure \ref{['fig:reliability-main']}. Red shading indicates overconfidence; green indicates underconfidence. NLI models are severely overconfident before calibration; ambiguity-aware methods substantially correct this.
  • ...and 28 more figures

Theorems & Definitions (8)

  • Definition 1: True-Label Calibration
  • Proposition 1: Direction of TS Bias
  • proof : Proof sketch
  • Proposition 2: True-Label Miscalibration and Annotation Entropy
  • proof : Proof sketch
  • Proposition 3: Correctness of distributional target
  • Remark 1: Independence assumption
  • Remark 2: Class-conditional limitation