Learning from Ambiguous Data with Hard Labels
Zeke Xie, Zheng He, Nan Lu, Lichen Bai, Bao Li, Shuo Yang, Mingming Sun, Ping Li
TL;DR
This work addresses learning from data with intrinsic ambiguity when only quantized hard labels are available, which can cause overconfident models and poor generalization. It introduces Quantized Label Learning (QLL) and the Class-wise Positive-Unlabeled (CPU) risk to leverage soft-label information indirectly and prevent overfitting to biased labels. A mixing-based Ambiguous Data Benchmark generates synthetic ambiguous data (CIFAR-10Q, CIFAR-100Q, AFHQ-Q) to evaluate the approach, with CPU based on a multi-class extension of nnPU risk and class-wise priors. Empirical results show significant generalization improvements over strong baselines, including robustness to hyperparameters and gains when combined with semi-supervised techniques, highlighting the method’s practical value for learning under label ambiguity.
Abstract
Real-world data often contains intrinsic ambiguity that the common single-hard-label annotation paradigm ignores. Standard training using ambiguous data with these hard labels may produce overly confident models and thus leading to poor generalization. In this paper, we propose a novel framework called Quantized Label Learning (QLL) to alleviate this issue. First, we formulate QLL as learning from (very) ambiguous data with hard labels: ideally, each ambiguous instance should be associated with a ground-truth soft-label distribution describing its corresponding probabilistic weight in each class, however, this is usually not accessible; in practice, we can only observe a quantized label, i.e., a hard label sampled (quantized) from the corresponding ground-truth soft-label distribution, of each instance, which can be seen as a biased approximation of the ground-truth soft-label. Second, we propose a Class-wise Positive-Unlabeled (CPU) risk estimator that allows us to train accurate classifiers from only ambiguous data with quantized labels. Third, to simulate ambiguous datasets with quantized labels in the real world, we design a mixing-based ambiguous data generation procedure for empirical evaluation. Experiments demonstrate that our CPU method can significantly improve model generalization performance and outperform the baselines.
