Table of Contents
Fetching ...

Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations

Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, Yang Liu

TL;DR

The paper addresses the gap between synthetic and real-world label noise by introducing CIFAR-10N and CIFAR-100N, real-world, human-annotated noisy-label benchmarks. It demonstrates that human noise is predominantly instance-dependent, with imbalanced and feature-correlated transition patterns, and may differ substantially from class-dependent synthetic models. The authors benchmark a broad set of robust methods, revealing notable performance gaps between human noise and synthetic noise and highlighting memorization dynamics that favor learning from clean signals but also cause overfitting to wrong labels. Overall, CIFAR-N provides accessible, ground-truth datasets and benchmarks to reevaluate learning with noisy labels and drive methodological advances toward real-world robustness.

Abstract

Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verification makes it hard to theoretically study the property and treatment of real-world label noise; (2) These efforts are often of large scales, which may result in unfair comparisons of robust methods within reasonable and accessible computation power. To better understand real-world label noise, it is crucial to build controllable and moderate-sized real-world noisy datasets with both ground-truth and noisy labels. This work presents two new benchmark datasets CIFAR-10N, CIFAR-100N, equipping the training datasets of CIFAR-10, CIFAR-100 with human-annotated real-world noisy labels we collected from Amazon Mechanical Turk. We quantitatively and qualitatively show that real-world noisy labels follow an instance-dependent pattern rather than the classically assumed and adopted ones (e.g., class-dependent label noise). We then initiate an effort to benchmarking a subset of the existing solutions using CIFAR-10N and CIFAR-100N. We further proceed to study the memorization of correct and wrong predictions, which further illustrates the difference between human noise and class-dependent synthetic noise. We show indeed the real-world noise patterns impose new and outstanding challenges as compared to synthetic label noise. These observations require us to rethink the treatment of noisy labels, and we hope the availability of these two datasets would facilitate the development and evaluation of future learning with noisy label solutions. Datasets and leaderboards are available at http://noisylabels.com.

Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations

TL;DR

The paper addresses the gap between synthetic and real-world label noise by introducing CIFAR-10N and CIFAR-100N, real-world, human-annotated noisy-label benchmarks. It demonstrates that human noise is predominantly instance-dependent, with imbalanced and feature-correlated transition patterns, and may differ substantially from class-dependent synthetic models. The authors benchmark a broad set of robust methods, revealing notable performance gaps between human noise and synthetic noise and highlighting memorization dynamics that favor learning from clean signals but also cause overfitting to wrong labels. Overall, CIFAR-N provides accessible, ground-truth datasets and benchmarks to reevaluate learning with noisy labels and drive methodological advances toward real-world robustness.

Abstract

Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verification makes it hard to theoretically study the property and treatment of real-world label noise; (2) These efforts are often of large scales, which may result in unfair comparisons of robust methods within reasonable and accessible computation power. To better understand real-world label noise, it is crucial to build controllable and moderate-sized real-world noisy datasets with both ground-truth and noisy labels. This work presents two new benchmark datasets CIFAR-10N, CIFAR-100N, equipping the training datasets of CIFAR-10, CIFAR-100 with human-annotated real-world noisy labels we collected from Amazon Mechanical Turk. We quantitatively and qualitatively show that real-world noisy labels follow an instance-dependent pattern rather than the classically assumed and adopted ones (e.g., class-dependent label noise). We then initiate an effort to benchmarking a subset of the existing solutions using CIFAR-10N and CIFAR-100N. We further proceed to study the memorization of correct and wrong predictions, which further illustrates the difference between human noise and class-dependent synthetic noise. We show indeed the real-world noise patterns impose new and outstanding challenges as compared to synthetic label noise. These observations require us to rethink the treatment of noisy labels, and we hope the availability of these two datasets would facilitate the development and evaluation of future learning with noisy label solutions. Datasets and leaderboards are available at http://noisylabels.com.

Paper Structure

This paper contains 54 sections, 8 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Categorical distribution of noisy labels on CIFAR-100N (most imbalanced 6 super-classes): the red line in each subplot indicates that the number of each clean fine class is 500.
  • Figure 2: Top 3 wrongly annotated fine labels in selected fine classes. For "pine tree", "shrew", "streetcar", the dominant class is the wrong class. The corresponding number of correct annotations are highlighted with red lines.
  • Figure 3: Transition matrix of CIFAR-10N noisy labels (color bar is log-norm transformed).
  • Figure 4: Exemplary CIFAR-100 training images with multiple labels. The text below each picture denotes the CIFAR-100 clean label (first row) and the human annotated noisy label (second row).
  • Figure 5: Illustration of noise transitions of human-level label noise and the synthetic version. We divide the representations of images from the same true class into $5$ clusters by $k$-means. The representations come from the output before the final fully-connected layer of ResNet34. The model is trained on clean CIFAR-10. Negative cosine similarity measures the distance between features.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Definition 1: $M$-NN noise clusterability
  • Definition 2: Memorized feature