Learning with Instance-Dependent Noisy Labels by Anchor Hallucination and Hard Sample Label Correction
Po-Hsuan Huang, Chia-Ching Lin, Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen
TL;DR
This work tackles image classification under instance-dependent label noise (IDN) by differentiating easy versus hard samples in addition to clean versus noisy labels. It introduces anchor hallucination to synthesize hard anchors from easy samples, enabling selection and label correction of hard samples, followed by semi-supervised training that leverages both corrected hard samples and easy samples. An iterative training procedure alternates between classifier optimization and hallucinator refinement, using a Gaussian Mixture Model for easy-sample selection, cosine-based anchor matching, and MixMatch-based SSL. Extensive experiments on synthetic IDN benchmarks and real-world datasets (CIFAR-10N/100N, Clothing1M) show consistent improvements over state-of-the-art NLL methods, highlighting the value of hard samples for shaping robust decision boundaries. The approach offers a new perspective on exploiting hard but clean data to improve robustness to IDN in practical settings and suggests avenues for broader application beyond image classification.
Abstract
Learning from noisy-labeled data is crucial for real-world applications. Traditional Noisy-Label Learning (NLL) methods categorize training data into clean and noisy sets based on the loss distribution of training samples. However, they often neglect that clean samples, especially those with intricate visual patterns, may also yield substantial losses. This oversight is particularly significant in datasets with Instance-Dependent Noise (IDN), where mislabeling probabilities correlate with visual appearance. Our approach explicitly distinguishes between clean vs.noisy and easy vs. hard samples. We identify training samples with small losses, assuming they have simple patterns and correct labels. Utilizing these easy samples, we hallucinate multiple anchors to select hard samples for label correction. Corrected hard samples, along with the easy samples, are used as labeled data in subsequent semi-supervised training. Experiments on synthetic and real-world IDN datasets demonstrate the superior performance of our method over other state-of-the-art NLL methods.
