Table of Contents
Fetching ...

Conformal-in-the-Loop for Learning with Imbalanced Noisy Data

John Brandon Graham-Knight, Jamil Fayyad, Nourhan Bayasi, Patricia Lasserre, Homayoun Najjaran

TL;DR

Conformal Prediction provides prediction sets with coverage $1-\alpha$, enabling uncertainty-aware training. CitL uses the prediction-set size $|\mathbf{P}_x|$ to weight losses and prunes highly uncertain examples, calibrated on a validation set, to address both class imbalance and noisy labels in a single training run. Across multiclass CIFAR-10 with synthetic noise/imbalance and CityScapes segmentation, CitL achieves up to $6.1\%$ accuracy gains and $5.0$ mIoU improvements with modest overhead, demonstrating practical robustness for real-world datasets. This approach offers a scalable, model-agnostic framework to emphasize informative, harder-to-learn examples while suppressing mislabeled data, improving generalization in imbalanced, noisy settings.

Abstract

Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.

Conformal-in-the-Loop for Learning with Imbalanced Noisy Data

TL;DR

Conformal Prediction provides prediction sets with coverage , enabling uncertainty-aware training. CitL uses the prediction-set size to weight losses and prunes highly uncertain examples, calibrated on a validation set, to address both class imbalance and noisy labels in a single training run. Across multiclass CIFAR-10 with synthetic noise/imbalance and CityScapes segmentation, CitL achieves up to accuracy gains and mIoU improvements with modest overhead, demonstrating practical robustness for real-world datasets. This approach offers a scalable, model-agnostic framework to emphasize informative, harder-to-learn examples while suppressing mislabeled data, improving generalization in imbalanced, noisy settings.

Abstract

Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.

Paper Structure

This paper contains 22 sections, 5 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: An overview of the proposed Conformal-in-the-Loop (CitL) framework. Conformal Prediction is used to generate prediction sets during self-supervised machine learning. Training examples are weighted by uncertainty; validation examples are used to calibrate the uncertainty model.
  • Figure 2: Segmentation mIoU on the validation set for various values of $\alpha$. The baseline performance is indicated by the dotted blue line.
  • Figure 3: Segmentation mIoU for values of $\alpha$ [0.01, 0.10]. The best mIoU of 71.1 is achieved at $\alpha = 0.02$. Baseline mIoU (66.1) is represented by the dotted blue line.
  • Figure 4: Classification accuracy on imbalanced CIFAR-10 using our CitL method and the baseline with cross-entropy loss (blue dotted line) and focal loss (orange dotted line). The box plots show 10 $\alpha$ values ranging from 0.10 to 0.19.
  • Figure 5: Average per training-step time of CitL over baseline. Multiple CitL values represent different $\alpha$ hyperparameters. (CIFAR-10 11.2 $\pm$ 0.8%, CityScapes 4.0 $\pm$ 0.4%)
  • ...and 7 more figures