Table of Contents
Fetching ...

Robust Testing for Deep Learning using Human Label Noise

Gordon Lim, Stefan Larson, Kevin Leach

TL;DR

This work tackles the realism gap in evaluating learning with noisy labels by showing that human-generated label noise is feature-dependent and more challenging than synthetic noise. It introduces Cluster-Based Noise (CBN) to mimic human noise by flipping labels within CLIP-space subclusters, and Soft Neighbor-Sampled Labeling (SNLS) to maintain appropriate uncertainty when learning from such noise. A CIFAR-10N case study reveals that human noise forms meaningful clusters and can be learned without memorization, informing the design of more robust LNL methods. Experiments indicate that many existing LNL approaches underperform under CBN compared to PMD, while SNLS provides consistent gains, underscoring the importance of evaluating robustness against realistic, feature-driven label noise in practical settings.

Abstract

In deep learning (DL) systems, label noise in training datasets often degrades model performance, as models may learn incorrect patterns from mislabeled data. The area of Learning with Noisy Labels (LNL) has introduced methods to effectively train DL models in the presence of noisily-labeled datasets. Traditionally, these methods are tested using synthetic label noise, where ground truth labels are randomly (and automatically) flipped. However, recent findings highlight that models perform substantially worse under human label noise than synthetic label noise, indicating a need for more realistic test scenarios that reflect noise introduced due to imperfect human labeling. This underscores the need for generating realistic noisy labels that simulate human label noise, enabling rigorous testing of deep neural networks without the need to collect new human-labeled datasets. To address this gap, we present Cluster-Based Noise (CBN), a method for generating feature-dependent noise that simulates human-like label noise. Using insights from our case study of label memorization in the CIFAR-10N dataset, we design CBN to create more realistic tests for evaluating LNL methods. Our experiments demonstrate that current LNL methods perform worse when tested using CBN, highlighting its use as a rigorous approach to testing neural networks. Next, we propose Soft Neighbor Label Sampling (SNLS), a method designed to handle CBN, demonstrating its improvement over existing techniques in tackling this more challenging type of noise.

Robust Testing for Deep Learning using Human Label Noise

TL;DR

This work tackles the realism gap in evaluating learning with noisy labels by showing that human-generated label noise is feature-dependent and more challenging than synthetic noise. It introduces Cluster-Based Noise (CBN) to mimic human noise by flipping labels within CLIP-space subclusters, and Soft Neighbor-Sampled Labeling (SNLS) to maintain appropriate uncertainty when learning from such noise. A CIFAR-10N case study reveals that human noise forms meaningful clusters and can be learned without memorization, informing the design of more robust LNL methods. Experiments indicate that many existing LNL approaches underperform under CBN compared to PMD, while SNLS provides consistent gains, underscoring the importance of evaluating robustness against realistic, feature-driven label noise in practical settings.

Abstract

In deep learning (DL) systems, label noise in training datasets often degrades model performance, as models may learn incorrect patterns from mislabeled data. The area of Learning with Noisy Labels (LNL) has introduced methods to effectively train DL models in the presence of noisily-labeled datasets. Traditionally, these methods are tested using synthetic label noise, where ground truth labels are randomly (and automatically) flipped. However, recent findings highlight that models perform substantially worse under human label noise than synthetic label noise, indicating a need for more realistic test scenarios that reflect noise introduced due to imperfect human labeling. This underscores the need for generating realistic noisy labels that simulate human label noise, enabling rigorous testing of deep neural networks without the need to collect new human-labeled datasets. To address this gap, we present Cluster-Based Noise (CBN), a method for generating feature-dependent noise that simulates human-like label noise. Using insights from our case study of label memorization in the CIFAR-10N dataset, we design CBN to create more realistic tests for evaluating LNL methods. Our experiments demonstrate that current LNL methods perform worse when tested using CBN, highlighting its use as a rigorous approach to testing neural networks. Next, we propose Soft Neighbor Label Sampling (SNLS), a method designed to handle CBN, demonstrating its improvement over existing techniques in tackling this more challenging type of noise.

Paper Structure

This paper contains 11 sections, 1 equation, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Memorization values for human noisy labels and synthetic class-dependent noisy labels from CIFAR-10N
  • Figure 2: Scatter plot of inclusion and exclusion probabilities for human noisy labels and synthetic class-dependent noisy labels from CIFAR-10N. The distribution is visibly more dense for human noisy labels when both probabilities exceed 0.6. We term these examples incorrect learned human noisy labels, representing labels that are challenging for LNL methods because they were learned without memorization despite being incorrect.
  • Figure 3: t-SNE plot of CIFAR-10 images' CLIP embeddings. Annotated points represent incorrect learned human noisy labels. There appear to be subclusters of these labels within their correct class clusters.
  • Figure 4: Top 10 closest images with incorrect learned human noisy labels within the classes airplane (1st row), cat (2nd), deer (3rd row), ship (4th row), and truck (5th row), identified by pairwise distance in the CLIP feature space. The incorrect human noisy labels are displayed above each image. Bounding box colors correspond to the color coding of the given CIFAR-10 labels in Fig. \ref{['fig:cifar10n-clip-embeddings']}.
  • Figure 5: Comparison of noise functions at the same noise rate, visualized following prog_noise_iclr2021. (a) Clean labels: Gaussian blob of data labeled by a vertical decision boundary. (b) Uniform: each point has an equal probability of flipping labels. (c) PMD: points near the decision boundary have a higher probability of having its label flipped. (d) CBN (ours): labels are flipped within tight clusters of similar points.
  • ...and 1 more figures