Table of Contents
Fetching ...

How many labelers do you have? A closer look at gold-standard labels

Chen Cheng, Hilal Asi, John Duchi

TL;DR

This work questions the reliance on gold-standard aggregated labels in supervised learning by introducing a stylized model with multiple noisy labels per example. It contrasts full-label empirical risk minimization with majority-vote aggregation, showing that, when the labeling process is well-aligned with the model, using all labels yields calibrated predictors and faster convergence, typically scaling with $m$ and $n$ (e.g., $O(1/\sqrt{nm})$ in simple setups). The paper also characterizes fundamental limits of learning from aggregated labels, demonstrates robustness of majority-vote methods under misspecification, and develops semi-parametric approaches that combine learning the label-generating process with classifier fitting to recover near-optimal $1/m$ scaling. Empirical results on BlueBirds, CIFAR-10H, and semisynthetic CIFAR-10 datasets corroborate the theory, showing improved calibration and accuracy with full-label information while highlighting the robustness of aggregation under uncertainty. Overall, the study lays mathematical foundations for the value of non-aggregated labeling information in dataset construction and learning pipelines, with practical implications for crowdsourcing and semi-supervised data workflows.

Abstract

The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of "gold-standard". We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.

How many labelers do you have? A closer look at gold-standard labels

TL;DR

This work questions the reliance on gold-standard aggregated labels in supervised learning by introducing a stylized model with multiple noisy labels per example. It contrasts full-label empirical risk minimization with majority-vote aggregation, showing that, when the labeling process is well-aligned with the model, using all labels yields calibrated predictors and faster convergence, typically scaling with and (e.g., in simple setups). The paper also characterizes fundamental limits of learning from aggregated labels, demonstrates robustness of majority-vote methods under misspecification, and develops semi-parametric approaches that combine learning the label-generating process with classifier fitting to recover near-optimal scaling. Empirical results on BlueBirds, CIFAR-10H, and semisynthetic CIFAR-10 datasets corroborate the theory, showing improved calibration and accuracy with full-label information while highlighting the robustness of aggregation under uncertainty. Overall, the study lays mathematical foundations for the value of non-aggregated labeling information in dataset construction and learning pipelines, with practical implications for crowdsourcing and semi-supervised data workflows.

Abstract

The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of "gold-standard". We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.
Paper Structure (83 sections, 37 theorems, 283 equations, 3 figures)

This paper contains 83 sections, 37 theorems, 283 equations, 3 figures.

Key Result

Corollary 1

Let $X \sim \mathsf{N}(0, I_d)$ and $t^\star = \|{\theta^\star}\|_2$. The maximum likelihood estimator $\widehat{\theta}^{\textup{lr}}_{n,m}$ is consistent, with $\widehat{\theta}^{\textup{lr}}_{n,m} \stackrel{p}{\rightarrow} \theta^\star$, and for $\mathsf{P}_{u^\star}^\perp = I_d - u^\star {u^\sta

Figures (3)

  • Figure 1: Experiments on BlueBirds dataset. (a) Classification error. (b) Calibration error $|\mathop{\rm logit}(\widetilde{p}) - \mathop{\rm logit}(p)|$ with ResNet features reduced via PCA to dimension $d=25$. Error bars show 2 standard error confidence bands over $T = 100$ trials.
  • Figure 2: Experiments on CIFAR-10H dataset. (a) Classification error. (b) Calibration error $|\mathop{\rm logit}(\widetilde{p}) - \mathop{\rm logit}(p)|$ with ResNet features reduced via PCA to dimension $d=40$. Error bars show 2 standard error confidence bands over $T = 100$ trials.
  • Figure 3: Experiments on the semisynthetic CIFAR-10 dataset. The median labeler accuracy represents the median $p$ of $p_i = \max_y \textup{softmax}(\alpha f(x_i))_y$ the training data. Results are averaged over $20$ trials, and vertical axes give mis-classification rate from (synthetic) ground-truth labels on held-out test set. Legend keys correspond to maximum likelihood $\widehat{\theta}^{\textup{mle}}_{n,m}$ (MLE), majority vote $\widehat{\theta}^{\textup{mv}}_{n,m}$ (MV), hard-labeled Dawid-Skene (DS) and GLAD crowdsourced estimators $\widehat{\theta}^{\textup{DS}}_{n,m}$ and $\widehat{\theta}^{\textup{GLAD}}_{n,m}$, and soft-labaled Dawid-Skene (DS prob) and GLAD (GLAD prob) estimators $\widehat{\theta}^{\textup{DS-prob}}_{n,m}$ and $\widehat{\theta}^{\textup{GLAD-prob}}_{n,m}$. (a) Comparison of methods using hard labels. (b) Comparison with crowdsourced estimated soft-labels. (c) Number of labelers $m$ versus test error for fixed median accuracy $p = .105$ (noisy labelers). (d) Number of labelers $m$ versus test error for fixed median accuracy $p = .4$. Both (c) and (d) report 95% error bars over the trials.

Theorems & Definitions (48)

  • Corollary 1
  • Corollary 2
  • Proposition 1
  • Proposition 2
  • Lemma 3.1
  • Theorem 1
  • Corollary 3: The well-specified case
  • Theorem 2
  • Corollary 4
  • Proposition 3
  • ...and 38 more