How many labelers do you have? A closer look at gold-standard labels
Chen Cheng, Hilal Asi, John Duchi
TL;DR
This work questions the reliance on gold-standard aggregated labels in supervised learning by introducing a stylized model with multiple noisy labels per example. It contrasts full-label empirical risk minimization with majority-vote aggregation, showing that, when the labeling process is well-aligned with the model, using all labels yields calibrated predictors and faster convergence, typically scaling with $m$ and $n$ (e.g., $O(1/\sqrt{nm})$ in simple setups). The paper also characterizes fundamental limits of learning from aggregated labels, demonstrates robustness of majority-vote methods under misspecification, and develops semi-parametric approaches that combine learning the label-generating process with classifier fitting to recover near-optimal $1/m$ scaling. Empirical results on BlueBirds, CIFAR-10H, and semisynthetic CIFAR-10 datasets corroborate the theory, showing improved calibration and accuracy with full-label information while highlighting the robustness of aggregation under uncertainty. Overall, the study lays mathematical foundations for the value of non-aggregated labeling information in dataset construction and learning pipelines, with practical implications for crowdsourcing and semi-supervised data workflows.
Abstract
The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of "gold-standard". We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.
