Extracting Clean and Balanced Subset for Noisy Long-tailed Classification
Zhuo Li, He Zhao, Zhen Li, Tongliang Liu, Dandan Guo, Xiang Wan
TL;DR
The paper tackles the challenge of learning under simultaneous long-tailed class distributions and label noise by framing pseudo-labeling as distribution matching between sample representations and class prototypes using entropy-regularized optimal transport (OT). It builds MOCO-based representations, forms class prototypes, and derives a transport plan $T$ whose rows provide soft pseudo-labels; a minority-focused weight $b_j$ via Effective Number biases the prototype distribution $Q$ to mitigate imbalance, guiding more balanced pseudo-labels. A simple intersection-based filter with observed labels and EMA-calibrated prototypes yields a clean, balanced training subset $\mathcal{X}$ for robust learning, and the method is shown to outperform strong baselines on synthetic CIFAR and real-world datasets like WebVision-50 and Red-Mini-Imagenet. The approach is efficient through Sinkhorn-regularized OT and is compatible with self-supervised pretraining and semi-supervised extensions, offering a practical solution for noisy long-tailed classification with broad applicability.
Abstract
Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
