Table of Contents
Fetching ...

Identifying Hard Noise in Long-Tailed Sample Distribution

Xuanyu Yi, Kaihua Tang, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang

TL;DR

This work tackles Noisy Long-Tailed Classification (NLT), where long-tailed priors render many previously detectable noises hard to identify. It introduces Hard-to-Easy (H2E), an iterative, two-stage framework that first learns a noise identifier invariant to class and context shifts via multi-environment Invariant Risk Minimization (IRM) and then trains a robust classifier with a long-tailed loss. The authors formalize the problem, propose an IRM-based noise converter with environment-specific sampling and augmentation, and validate on newly constructed benchmarks (ImageNet-NLT, Animal10-NLT, Food101-NLT) where H2E consistently outperforms state-of-the-art de-noising and long-tailed methods. The results show that learning distribution-invariant representations effectively turns hard noises into easy ones, enabling stable performance under realistic noisy and imbalanced conditions with practical implications for large-scale data cleaning and robust learning.

Abstract

Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples. We introduce this new challenge as Noisy Long-Tailed Classification (NLT). Not surprisingly, we find that most de-noising methods fail to identify the hard noises, resulting in significant performance drop on the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT. To this end, we design an iterative noisy learning framework called Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier as noise identifier invariant to the class and context distributional changes, reducing "hard" noises to "easy" ones, whose removal further improves the invariance. Experimental results show that our H2E outperforms state-of-the-art de-noising methods and their ablations on long-tailed settings while maintaining a stable performance on the conventional balanced settings. Datasets and codes are available at https://github.com/yxymessi/H2E-Framework

Identifying Hard Noise in Long-Tailed Sample Distribution

TL;DR

This work tackles Noisy Long-Tailed Classification (NLT), where long-tailed priors render many previously detectable noises hard to identify. It introduces Hard-to-Easy (H2E), an iterative, two-stage framework that first learns a noise identifier invariant to class and context shifts via multi-environment Invariant Risk Minimization (IRM) and then trains a robust classifier with a long-tailed loss. The authors formalize the problem, propose an IRM-based noise converter with environment-specific sampling and augmentation, and validate on newly constructed benchmarks (ImageNet-NLT, Animal10-NLT, Food101-NLT) where H2E consistently outperforms state-of-the-art de-noising and long-tailed methods. The results show that learning distribution-invariant representations effectively turns hard noises into easy ones, enabling stable performance under realistic noisy and imbalanced conditions with practical implications for large-scale data cleaning and robust learning.

Abstract

Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples. We introduce this new challenge as Noisy Long-Tailed Classification (NLT). Not surprisingly, we find that most de-noising methods fail to identify the hard noises, resulting in significant performance drop on the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT. To this end, we design an iterative noisy learning framework called Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier as noise identifier invariant to the class and context distributional changes, reducing "hard" noises to "easy" ones, whose removal further improves the invariance. Experimental results show that our H2E outperforms state-of-the-art de-noising methods and their ablations on long-tailed settings while maintaining a stable performance on the conventional balanced settings. Datasets and codes are available at https://github.com/yxymessi/H2E-Framework
Paper Structure (20 sections, 5 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Large-scale datasets are both long-tailed and noisy. For instance, a head category "cat" may contain noisy samples such as "leopard" and "cartoon tiger" while noise like "porcupine" and "spiny horse" in tail category "hedgehog" (b) The identification of noise based on classifier confidence (or training loss) is no longer applicable in tail classes for most de-noise algorithms
  • Figure 2: The comparison of CE (cross-entropy) shore1981properties and Logit-Adjustment menon2020long in CIFAR-100 with different noise ratios.
  • Figure 3: Multi-environment with diverse class and context distributions are built, then an IRM optimization arjovsky2019invariant is applied to obtain an invariant identifier across environments
  • Figure 4: The example of iterative hard-to-easy transformation on Red ImageNet-NLT, presenting H2E gradually detects harder noises and improve overall robustness
  • Figure 5: (a) Evaluation(Precision) of noise identification capability on Blue ImageNet-NLT. The proposed H2E indeed significantly improves the Few-shot(tail) categories by better identifying hard noises. (b)Evaluations (Top-1 Accuracy%) on Red ImageNet-NLT. We compare test accuracy in Many, Medium and Few shots among different methods
  • ...and 5 more figures