Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Zhuo Li; He Zhao; Zhen Li; Tongliang Liu; Dandan Guo; Xiang Wan

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Zhuo Li, He Zhao, Zhen Li, Tongliang Liu, Dandan Guo, Xiang Wan

TL;DR

The paper tackles the challenge of learning under simultaneous long-tailed class distributions and label noise by framing pseudo-labeling as distribution matching between sample representations and class prototypes using entropy-regularized optimal transport (OT). It builds MOCO-based representations, forms class prototypes, and derives a transport plan $T$ whose rows provide soft pseudo-labels; a minority-focused weight $b_j$ via Effective Number biases the prototype distribution $Q$ to mitigate imbalance, guiding more balanced pseudo-labels. A simple intersection-based filter with observed labels and EMA-calibrated prototypes yields a clean, balanced training subset $\mathcal{X}$ for robust learning, and the method is shown to outperform strong baselines on synthetic CIFAR and real-world datasets like WebVision-50 and Red-Mini-Imagenet. The approach is efficient through Sinkhorn-regularized OT and is compatible with self-supervised pretraining and semi-supervised extensions, offering a practical solution for noisy long-tailed classification with broad applicability.

Abstract

Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

TL;DR

whose rows provide soft pseudo-labels; a minority-focused weight

via Effective Number biases the prototype distribution

to mitigate imbalance, guiding more balanced pseudo-labels. A simple intersection-based filter with observed labels and EMA-calibrated prototypes yields a clean, balanced training subset

for robust learning, and the method is shown to outperform strong baselines on synthetic CIFAR and real-world datasets like WebVision-50 and Red-Mini-Imagenet. The approach is efficient through Sinkhorn-regularized OT and is compatible with self-supervised pretraining and semi-supervised extensions, offering a practical solution for noisy long-tailed classification with broad applicability.

Abstract

Paper Structure (35 sections, 9 equations, 12 figures, 13 tables, 1 algorithm)

This paper contains 35 sections, 9 equations, 12 figures, 13 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
Problem formulation
Optimal Transport
Our Proposed Method
Pseudo-label samples based on OT measurement.
Filtering the clean subset based on Pseudo-label
Implementation details
Experiments
Experiments on simulated CIFAR-10/100
Experiments on real-world datasets
Analysis
Conclusion
Broader Impact
...and 20 more sections

Figures (12)

Figure 1: Overview of our proposed method. We first view the sample features and class prototypes as two distributions, where the OT distance between them can be minimized. Then we pseudo-label samples based on the learned transport plan followed by a filter criteria to extract a class-balanced and clean subset, which is used to train the encoder and classifier.
Figure 2: Influence of $\beta$ on performance with three partitions.
Figure 3: Variations of imbalance factor and noise ratio with our method, where we consider using label filter or not.
Figure 4: We visualize number of samples per class in CIFAR-10 with imbalance factor 100 and noise ratio 0.5 with synthetic Joint Noise (a), Symmetric Noise (b) and Asymmetric Noise (c).
Figure 5: Visualization of transport plan (w/ or w/o label filter) and ground truth label. The above represents $\beta=0$ and the below is $\beta=0.98$. The vertical axis represents the real labels of the samples, and the horizontal axis represents the index of the class prototypes.
...and 7 more figures

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

TL;DR

Abstract

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)