Robust Testing for Deep Learning using Human Label Noise
Gordon Lim, Stefan Larson, Kevin Leach
TL;DR
This work tackles the realism gap in evaluating learning with noisy labels by showing that human-generated label noise is feature-dependent and more challenging than synthetic noise. It introduces Cluster-Based Noise (CBN) to mimic human noise by flipping labels within CLIP-space subclusters, and Soft Neighbor-Sampled Labeling (SNLS) to maintain appropriate uncertainty when learning from such noise. A CIFAR-10N case study reveals that human noise forms meaningful clusters and can be learned without memorization, informing the design of more robust LNL methods. Experiments indicate that many existing LNL approaches underperform under CBN compared to PMD, while SNLS provides consistent gains, underscoring the importance of evaluating robustness against realistic, feature-driven label noise in practical settings.
Abstract
In deep learning (DL) systems, label noise in training datasets often degrades model performance, as models may learn incorrect patterns from mislabeled data. The area of Learning with Noisy Labels (LNL) has introduced methods to effectively train DL models in the presence of noisily-labeled datasets. Traditionally, these methods are tested using synthetic label noise, where ground truth labels are randomly (and automatically) flipped. However, recent findings highlight that models perform substantially worse under human label noise than synthetic label noise, indicating a need for more realistic test scenarios that reflect noise introduced due to imperfect human labeling. This underscores the need for generating realistic noisy labels that simulate human label noise, enabling rigorous testing of deep neural networks without the need to collect new human-labeled datasets. To address this gap, we present Cluster-Based Noise (CBN), a method for generating feature-dependent noise that simulates human-like label noise. Using insights from our case study of label memorization in the CIFAR-10N dataset, we design CBN to create more realistic tests for evaluating LNL methods. Our experiments demonstrate that current LNL methods perform worse when tested using CBN, highlighting its use as a rigorous approach to testing neural networks. Next, we propose Soft Neighbor Label Sampling (SNLS), a method designed to handle CBN, demonstrating its improvement over existing techniques in tackling this more challenging type of noise.
