Table of Contents
Fetching ...

Learning from Ambiguous Data with Hard Labels

Zeke Xie, Zheng He, Nan Lu, Lichen Bai, Bao Li, Shuo Yang, Mingming Sun, Ping Li

TL;DR

This work addresses learning from data with intrinsic ambiguity when only quantized hard labels are available, which can cause overconfident models and poor generalization. It introduces Quantized Label Learning (QLL) and the Class-wise Positive-Unlabeled (CPU) risk to leverage soft-label information indirectly and prevent overfitting to biased labels. A mixing-based Ambiguous Data Benchmark generates synthetic ambiguous data (CIFAR-10Q, CIFAR-100Q, AFHQ-Q) to evaluate the approach, with CPU based on a multi-class extension of nnPU risk and class-wise priors. Empirical results show significant generalization improvements over strong baselines, including robustness to hyperparameters and gains when combined with semi-supervised techniques, highlighting the method’s practical value for learning under label ambiguity.

Abstract

Real-world data often contains intrinsic ambiguity that the common single-hard-label annotation paradigm ignores. Standard training using ambiguous data with these hard labels may produce overly confident models and thus leading to poor generalization. In this paper, we propose a novel framework called Quantized Label Learning (QLL) to alleviate this issue. First, we formulate QLL as learning from (very) ambiguous data with hard labels: ideally, each ambiguous instance should be associated with a ground-truth soft-label distribution describing its corresponding probabilistic weight in each class, however, this is usually not accessible; in practice, we can only observe a quantized label, i.e., a hard label sampled (quantized) from the corresponding ground-truth soft-label distribution, of each instance, which can be seen as a biased approximation of the ground-truth soft-label. Second, we propose a Class-wise Positive-Unlabeled (CPU) risk estimator that allows us to train accurate classifiers from only ambiguous data with quantized labels. Third, to simulate ambiguous datasets with quantized labels in the real world, we design a mixing-based ambiguous data generation procedure for empirical evaluation. Experiments demonstrate that our CPU method can significantly improve model generalization performance and outperform the baselines.

Learning from Ambiguous Data with Hard Labels

TL;DR

This work addresses learning from data with intrinsic ambiguity when only quantized hard labels are available, which can cause overconfident models and poor generalization. It introduces Quantized Label Learning (QLL) and the Class-wise Positive-Unlabeled (CPU) risk to leverage soft-label information indirectly and prevent overfitting to biased labels. A mixing-based Ambiguous Data Benchmark generates synthetic ambiguous data (CIFAR-10Q, CIFAR-100Q, AFHQ-Q) to evaluate the approach, with CPU based on a multi-class extension of nnPU risk and class-wise priors. Empirical results show significant generalization improvements over strong baselines, including robustness to hyperparameters and gains when combined with semi-supervised techniques, highlighting the method’s practical value for learning under label ambiguity.

Abstract

Real-world data often contains intrinsic ambiguity that the common single-hard-label annotation paradigm ignores. Standard training using ambiguous data with these hard labels may produce overly confident models and thus leading to poor generalization. In this paper, we propose a novel framework called Quantized Label Learning (QLL) to alleviate this issue. First, we formulate QLL as learning from (very) ambiguous data with hard labels: ideally, each ambiguous instance should be associated with a ground-truth soft-label distribution describing its corresponding probabilistic weight in each class, however, this is usually not accessible; in practice, we can only observe a quantized label, i.e., a hard label sampled (quantized) from the corresponding ground-truth soft-label distribution, of each instance, which can be seen as a biased approximation of the ground-truth soft-label. Second, we propose a Class-wise Positive-Unlabeled (CPU) risk estimator that allows us to train accurate classifiers from only ambiguous data with quantized labels. Third, to simulate ambiguous datasets with quantized labels in the real world, we design a mixing-based ambiguous data generation procedure for empirical evaluation. Experiments demonstrate that our CPU method can significantly improve model generalization performance and outperform the baselines.
Paper Structure (13 sections, 6 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the CPU risk estimator. For a classification task with $c$ classes, we train a classifier to predict the positive/negative label for each class independently. The CPU risk is obtained by averaging the binary PU risks of $c$ classes.
  • Figure 2: Examples of synthetic AFHQ-Q training data with the corresponding quantized labels, generated by pre-trained StyleMapGAN. (a) Generated by "dog" and "cat". (b) Generated by "dog" and "cat". (c) Generated by "wild" and "dog".
  • Figure 3: Illustration of robustness to different $\pi_{\textnormal{p}}^{(1)}$. The performance of CPU is robust to various class priors.
  • Figure 4: The test curves of the baselines. Left: CIFAR-10Q, Mixup, $m=2$. Right: CIFAR-10Q, PatchMix, $m=4$.

Theorems & Definitions (1)

  • Definition 1: Quantized Label