Table of Contents
Fetching ...

Learning from Noisy Labels for Long-tailed Data via Optimal Transport

Mengting Li, Chuang Zhu

TL;DR

This work tackles learning with noisy labels on long-tailed data by proposing OTLNL, a two-stage framework that first uses a dynamic loss-distance cross-selection to robustly filter clean samples by integrating label predictions with feature-centroid information, and then employs optimal transport to generate high-quality pseudo-labels for the noisy set, with class centroids replacing individual features in the transport cost to alleviate tail sparsity. The method combines a dynamic, class-specific thresholding strategy, centroid-based filtering, and OT-driven semi-supervised denoising with a consistency and contrastive loss, resulting in improved pseudo-label quality and more balanced learning across head and tail classes. Extensive experiments on synthetic CIFAR-10/100 with long-tailed noise and WebVision demonstrate state-of-the-art performance and clear ablation-supported contributions, highlighting the practical impact of robust LNL under real-world imbalanced noise. Overall, the paper presents a novel integration of optimal transport into noisy-label learning under long-tailed distributions, with strong empirical evidence that improves resilience to both label noise and class imbalance in computer vision tasks.

Abstract

Noisy labels, which are common in real-world datasets, can significantly impair the training of deep learning models. However, recent adversarial noise-combating methods overlook the long-tailed distribution of real data, which can significantly harm the effect of denoising strategies. Meanwhile, the mismanagement of noisy labels further compromises the model's ability to handle long-tailed data. To tackle this issue, we propose a novel approach to manage data characterized by both long-tailed distributions and noisy labels. First, we introduce a loss-distance cross-selection module, which integrates class predictions and feature distributions to filter clean samples, effectively addressing uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, we employ optimal transport strategies to generate pseudo-labels for the noise set in a semi-supervised training manner, enhancing pseudo-label quality while mitigating the effects of sample scarcity caused by the long-tailed distribution. We conduct experiments on both synthetic and real-world datasets, and the comprehensive experimental results demonstrate that our method surpasses current state-of-the-art methods. Our code will be available in the future.

Learning from Noisy Labels for Long-tailed Data via Optimal Transport

TL;DR

This work tackles learning with noisy labels on long-tailed data by proposing OTLNL, a two-stage framework that first uses a dynamic loss-distance cross-selection to robustly filter clean samples by integrating label predictions with feature-centroid information, and then employs optimal transport to generate high-quality pseudo-labels for the noisy set, with class centroids replacing individual features in the transport cost to alleviate tail sparsity. The method combines a dynamic, class-specific thresholding strategy, centroid-based filtering, and OT-driven semi-supervised denoising with a consistency and contrastive loss, resulting in improved pseudo-label quality and more balanced learning across head and tail classes. Extensive experiments on synthetic CIFAR-10/100 with long-tailed noise and WebVision demonstrate state-of-the-art performance and clear ablation-supported contributions, highlighting the practical impact of robust LNL under real-world imbalanced noise. Overall, the paper presents a novel integration of optimal transport into noisy-label learning under long-tailed distributions, with strong empirical evidence that improves resilience to both label noise and class imbalance in computer vision tasks.

Abstract

Noisy labels, which are common in real-world datasets, can significantly impair the training of deep learning models. However, recent adversarial noise-combating methods overlook the long-tailed distribution of real data, which can significantly harm the effect of denoising strategies. Meanwhile, the mismanagement of noisy labels further compromises the model's ability to handle long-tailed data. To tackle this issue, we propose a novel approach to manage data characterized by both long-tailed distributions and noisy labels. First, we introduce a loss-distance cross-selection module, which integrates class predictions and feature distributions to filter clean samples, effectively addressing uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, we employ optimal transport strategies to generate pseudo-labels for the noise set in a semi-supervised training manner, enhancing pseudo-label quality while mitigating the effects of sample scarcity caused by the long-tailed distribution. We conduct experiments on both synthetic and real-world datasets, and the comprehensive experimental results demonstrate that our method surpasses current state-of-the-art methods. Our code will be available in the future.
Paper Structure (13 sections, 12 equations, 2 figures, 3 tables)

This paper contains 13 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The framework of OTLNL. Initially, in the sample selection phase, our loss-distance cross-selection module integrates the model's predictions of sample class probabilities and sample feature distributions to filter clean samples, thereby addressing the uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, during the optimization denoising phase, we employ optimization strategies to generate pseudo-labels for the noise set, enhancing pseudo-label quality while mitigating the effects of sample scarcity.
  • Figure 2: F1-score of sample selection for the head, medium and tail classes on CIFAR-10 and CIFAR-100 datasets under $\gamma = 0.5$ and $\rho = 100$