Table of Contents
Fetching ...

Mitigating Label Noise through Data Ambiguation

Julian Lienen, Eyke Hüllermeier

TL;DR

This work tackles label noise in deep learning by introducing Robust Data Ambiguation (RDA), which represents training targets as credal sets rather than fixed labels. Using the superset learning framework, RDA derives set-valued targets from model predictions and a confidence-based threshold, enabling the optimizer to avoid memorizing mislabeled data. The approach is instantiated with a KL-based loss and a convex projection onto credal sets, controlled by hyperparameters $eta$ and $\\alpha$ to regulate cautiousness and relaxation. Empirical results on CIFAR-10/100 with synthetic noise and real-world noisy datasets (WebVision, Clothing1M, CIFAR-10N) show improved generalization without extra model parameters, demonstrating that data ambiguation can effectively suppress memorization while preserving learning from clean samples.

Abstract

Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.

Mitigating Label Noise through Data Ambiguation

TL;DR

This work tackles label noise in deep learning by introducing Robust Data Ambiguation (RDA), which represents training targets as credal sets rather than fixed labels. Using the superset learning framework, RDA derives set-valued targets from model predictions and a confidence-based threshold, enabling the optimizer to avoid memorizing mislabeled data. The approach is instantiated with a KL-based loss and a convex projection onto credal sets, controlled by hyperparameters and to regulate cautiousness and relaxation. Empirical results on CIFAR-10/100 with synthetic noise and real-world noisy datasets (WebVision, Clothing1M, CIFAR-10N) show improved generalization without extra model parameters, demonstrating that data ambiguation can effectively suppress memorization while preserving learning from clean samples.

Abstract

Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.
Paper Structure (14 sections, 6 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: For ResNet34 models trained with cross-entropy on CIFAR-10 with 25 % of corrupted instances (averaged over five seeds), the left plot shows the fractions of examples that are correctly classified, whose corrupted training label are memorized, or incorrectly classified with a label other than the ground-truth or training label, confirming the result in Liu2020EarlyLearningRP. The right plot illustrates the predicted probability magnitudes for clean or noisy labels.
  • Figure 2: Learned feature representations of the training instances observed at the penultimate layer a MLP comprising an encoder and a classification head at different stages in the training. The data consists of correctly (blue or green resp.) and incorrectly (red) labeled images of zeros and ones from MNIST. The dashed line depicts the linear classifier.
  • Figure 3: A barycentric visualization of the confidence-thresholded ambiguation for a corrupt training label $y_1$ and a ground-truth $y_2$ in the target space $\mathcal{Y} = \{y_1, y_2, y_3\}$: Starting from a credal set $\mathcal{Q}$ centered at $p_{y_1}$ (left plot), the prediction $\widehat{p}$ predicts a probability mass greater than $\beta$ for $y_2$. Consequently, full possibility is assigned to $y_2$, leading to $\mathcal{Q}$ as shown to the right.
  • Figure 4: The top plot shows the fraction of mislabeled training instances for which the models predict the ground-truth (blue), the wrong training label (orange) or a different label (green). The middle and bottom plots show the credal set size and validity respectively. All plots are averaged over the five runs on CIFAR-10 with 50 % synthetic symmetric noise.