Table of Contents
Fetching ...

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Zhuo Li, He Zhao, Zhen Li, Tongliang Liu, Dandan Guo, Xiang Wan

TL;DR

The paper tackles the challenge of learning under simultaneous long-tailed class distributions and label noise by framing pseudo-labeling as distribution matching between sample representations and class prototypes using entropy-regularized optimal transport (OT). It builds MOCO-based representations, forms class prototypes, and derives a transport plan $T$ whose rows provide soft pseudo-labels; a minority-focused weight $b_j$ via Effective Number biases the prototype distribution $Q$ to mitigate imbalance, guiding more balanced pseudo-labels. A simple intersection-based filter with observed labels and EMA-calibrated prototypes yields a clean, balanced training subset $\mathcal{X}$ for robust learning, and the method is shown to outperform strong baselines on synthetic CIFAR and real-world datasets like WebVision-50 and Red-Mini-Imagenet. The approach is efficient through Sinkhorn-regularized OT and is compatible with self-supervised pretraining and semi-supervised extensions, offering a practical solution for noisy long-tailed classification with broad applicability.

Abstract

Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

TL;DR

The paper tackles the challenge of learning under simultaneous long-tailed class distributions and label noise by framing pseudo-labeling as distribution matching between sample representations and class prototypes using entropy-regularized optimal transport (OT). It builds MOCO-based representations, forms class prototypes, and derives a transport plan whose rows provide soft pseudo-labels; a minority-focused weight via Effective Number biases the prototype distribution to mitigate imbalance, guiding more balanced pseudo-labels. A simple intersection-based filter with observed labels and EMA-calibrated prototypes yields a clean, balanced training subset for robust learning, and the method is shown to outperform strong baselines on synthetic CIFAR and real-world datasets like WebVision-50 and Red-Mini-Imagenet. The approach is efficient through Sinkhorn-regularized OT and is compatible with self-supervised pretraining and semi-supervised extensions, offering a practical solution for noisy long-tailed classification with broad applicability.

Abstract

Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
Paper Structure (35 sections, 9 equations, 12 figures, 13 tables, 1 algorithm)

This paper contains 35 sections, 9 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of our proposed method. We first view the sample features and class prototypes as two distributions, where the OT distance between them can be minimized. Then we pseudo-label samples based on the learned transport plan followed by a filter criteria to extract a class-balanced and clean subset, which is used to train the encoder and classifier.
  • Figure 2: Influence of $\beta$ on performance with three partitions.
  • Figure 3: Variations of imbalance factor and noise ratio with our method, where we consider using label filter or not.
  • Figure 4: We visualize number of samples per class in CIFAR-10 with imbalance factor 100 and noise ratio 0.5 with synthetic Joint Noise (a), Symmetric Noise (b) and Asymmetric Noise (c).
  • Figure 5: Visualization of transport plan (w/ or w/o label filter) and ground truth label. The above represents $\beta=0$ and the below is $\beta=0.98$. The vertical axis represents the real labels of the samples, and the horizontal axis represents the index of the class prototypes.
  • ...and 7 more figures