Table of Contents
Fetching ...

Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection

Huafeng Liu, Mengmeng Sheng, Zeren Sun, Yazhou Yao, Xian-Sheng Hua, Heng-Tao Shen

TL;DR

This work tackles learning with noisy labels in imbalanced data. It introduces a unified framework comprising Class-Balance-based Sample Selection (CBS), Confidence-based Sample Augmentation (CSA), Exponential Moving Average (EMA) based label correction, Average Confidence Margin (ACM), and consistency regularization to exploit both clean and corrected noisy samples. The method normalizes per-class losses, augments clean samples with confidence-weighted Mixup-like operations, corrects noisy labels through prediction history, and gates corrections using ACM, while enforcing prediction consistency. Experiments on synthetic CIFAR datasets and real-world web-noise datasets show significant improvements, especially under severe imbalance, demonstrating practical robustness.

Abstract

Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain not only noisy labels but also class imbalance. The imbalance issue is prone to causing failure in the loss-based sample selection since the under-learning of tail classes also leans to produce high losses. To this end, we propose a simple yet effective method to address noisy labels in imbalanced datasets. Specifically, we propose Class-Balance-based sample Selection (CBS) to prevent the tail class samples from being neglected during training. We propose Confidence-based Sample Augmentation (CSA) for the chosen clean samples to enhance their reliability in the training process. To exploit selected noisy samples, we resort to prediction history to rectify labels of noisy samples. Moreover, we introduce the Average Confidence Margin (ACM) metric to measure the quality of corrected labels by leveraging the model's evolving training dynamics, thereby ensuring that low-quality corrected noisy samples are appropriately masked out. Lastly, consistency regularization is imposed on filtered label-corrected noisy samples to boost model performance. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios.

Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection

TL;DR

This work tackles learning with noisy labels in imbalanced data. It introduces a unified framework comprising Class-Balance-based Sample Selection (CBS), Confidence-based Sample Augmentation (CSA), Exponential Moving Average (EMA) based label correction, Average Confidence Margin (ACM), and consistency regularization to exploit both clean and corrected noisy samples. The method normalizes per-class losses, augments clean samples with confidence-weighted Mixup-like operations, corrects noisy labels through prediction history, and gates corrections using ACM, while enforcing prediction consistency. Experiments on synthetic CIFAR datasets and real-world web-noise datasets show significant improvements, especially under severe imbalance, demonstrating practical robustness.

Abstract

Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain not only noisy labels but also class imbalance. The imbalance issue is prone to causing failure in the loss-based sample selection since the under-learning of tail classes also leans to produce high losses. To this end, we propose a simple yet effective method to address noisy labels in imbalanced datasets. Specifically, we propose Class-Balance-based sample Selection (CBS) to prevent the tail class samples from being neglected during training. We propose Confidence-based Sample Augmentation (CSA) for the chosen clean samples to enhance their reliability in the training process. To exploit selected noisy samples, we resort to prediction history to rectify labels of noisy samples. Moreover, we introduce the Average Confidence Margin (ACM) metric to measure the quality of corrected labels by leveraging the model's evolving training dynamics, thereby ensuring that low-quality corrected noisy samples are appropriately masked out. Lastly, consistency regularization is imposed on filtered label-corrected noisy samples to boost model performance. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios.
Paper Structure (19 sections, 17 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 17 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The sample distribution (left) and the mean loss variation (right) on noisy and imbalanced CIFAR100 (noise rate is 0.4 and imbalance factor is 20). We can find: (1) both tail class samples and noisy samples exhibit large losses; (2) losses of some clean samples belonging to tail classes are even larger than losses of some noisy ones from head classes. Accordingly, existing low-loss-based sample selection methods tend to fail when distinguishing clean and noisy samples. This inspires us to develop a class-balanced sample selection method to combat noisy and imbalanced labels.
  • Figure 2: The overall framework of our proposed approach. We first divide the noisy training set into clean and noisy subsets in a class-balanced manner based on the proposed class-balance-based sample selection (CBS) method. Then, for samples in the clean subset, we propose a confidence-based sample augmentation (CSA) method to enhance the reliability of the selected clean samples. Subsequently, the exponential moving average (EMA) is adopted for correcting labels of noisy samples. Thus, noisy samples are also used for model training. Besides, the average confidence margin (ACM) is proposed to measure the quality of corrected labels as the training progresses. Finally, we employ consistency regularization to boost the model performance further. This regularization term can not only enhance the extracted features but also stabilize training by encouraging epoch-wise prediction consistency.
  • Figure 3: The number of samples belonging to each class in CIFAR100 under various imbalance factor settings (left) and an example of the uniform noise transition matrix (right).
  • Figure 4: The test accuracy (%) vs. epochs on CIFAR100 with IF-1-NR-0% (a), IF-1-NR-20% (b), IF-10-NR-20% (c) and IF-50-NR-60% (d) during the training process. (IF-X-NR-Y% means that the imbalance factor and the noise rate are X and Y%, respectively.)
  • Figure 5: Some visualization results of clean and noisy samples selected by our sample selection methods on Web-Aircraft, Web-Bird, and Web-Car. The corresponding fine-grained class names are DHC-1, frigatebird, and Ferrari 458 Italia Coupe 2012.
  • ...and 1 more figures