Learning from Noisy Labels for Long-tailed Data via Optimal Transport
Mengting Li, Chuang Zhu
TL;DR
This work tackles learning with noisy labels on long-tailed data by proposing OTLNL, a two-stage framework that first uses a dynamic loss-distance cross-selection to robustly filter clean samples by integrating label predictions with feature-centroid information, and then employs optimal transport to generate high-quality pseudo-labels for the noisy set, with class centroids replacing individual features in the transport cost to alleviate tail sparsity. The method combines a dynamic, class-specific thresholding strategy, centroid-based filtering, and OT-driven semi-supervised denoising with a consistency and contrastive loss, resulting in improved pseudo-label quality and more balanced learning across head and tail classes. Extensive experiments on synthetic CIFAR-10/100 with long-tailed noise and WebVision demonstrate state-of-the-art performance and clear ablation-supported contributions, highlighting the practical impact of robust LNL under real-world imbalanced noise. Overall, the paper presents a novel integration of optimal transport into noisy-label learning under long-tailed distributions, with strong empirical evidence that improves resilience to both label noise and class imbalance in computer vision tasks.
Abstract
Noisy labels, which are common in real-world datasets, can significantly impair the training of deep learning models. However, recent adversarial noise-combating methods overlook the long-tailed distribution of real data, which can significantly harm the effect of denoising strategies. Meanwhile, the mismanagement of noisy labels further compromises the model's ability to handle long-tailed data. To tackle this issue, we propose a novel approach to manage data characterized by both long-tailed distributions and noisy labels. First, we introduce a loss-distance cross-selection module, which integrates class predictions and feature distributions to filter clean samples, effectively addressing uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, we employ optimal transport strategies to generate pseudo-labels for the noise set in a semi-supervised training manner, enhancing pseudo-label quality while mitigating the effects of sample scarcity caused by the long-tailed distribution. We conduct experiments on both synthetic and real-world datasets, and the comprehensive experimental results demonstrate that our method surpasses current state-of-the-art methods. Our code will be available in the future.
