Table of Contents
Fetching ...

Calibration improves detection of mislabeled examples

Ilies Chibane, Thomas George, Pierre Nodet, Vincent Lemaire

TL;DR

This paper tackles mislabeled data by examining model-probing detectors whose trust scores rely on a base model's confidences. It proposes a simple calibration step prior to probing, using isotonic regression or Platt scaling, to produce more reliable scores across models and ensembles. Across a large benchmark of 19 multiclass, weak-label datasets, calibrated detectors consistently improve the end-to-end detection–filtering–training pipeline, including when the calibration data itself contains label noise or is small. The work argues that calibration is a practical, low-cost enhancement that yields robust gains in mislabel detection and downstream classifier performance, especially in class-imbalanced or underrepresented subgroups.

Abstract

Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.

Calibration improves detection of mislabeled examples

TL;DR

This paper tackles mislabeled data by examining model-probing detectors whose trust scores rely on a base model's confidences. It proposes a simple calibration step prior to probing, using isotonic regression or Platt scaling, to produce more reliable scores across models and ensembles. Across a large benchmark of 19 multiclass, weak-label datasets, calibrated detectors consistently improve the end-to-end detection–filtering–training pipeline, including when the calibration data itself contains label noise or is small. The work argues that calibration is a practical, low-cost enhancement that yields robust gains in mislabel detection and downstream classifier performance, especially in class-imbalanced or underrepresented subgroups.

Abstract

Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.

Paper Structure

This paper contains 28 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Schematic summary of calibration of model-probing detection methods with ensembling strategies. All models generated (or trained) by an ensembling strategy are independently calibrated on the same held-out calibration (cal.) set before probing.
  • Figure 2: Evolution of the number of removed examples when filtering instances from the least to the most trusted ones on the x-axis, with the number of removed examples from the minority classes (classes with priors less than $1/C$) on the y-axis. We plot the median ratio over all datasets alongside the 25% and 75% percentiles for detectors without calibration (none) and detectors with Platt scaling (sigmoid) and isotonic regression. Detectors without calibration have the tendency to consider minority examples as untrusted, whereas calibrated detectors do not.
  • Figure 3: Distribution (boxplot) of the normalized (base 100, silver = training on correctly labeled examples only, base 200, none = training on all examples including mislabeled ones) test loss (lower is better) of the final classifier after the 3-stages pipeline of varying detectors over all datasets (the circles $\bullet$). Classifiers are calibrated using a clean calibration set. Calibrated detectors are significantly better than their adjusted or baseline counterparts.
  • Figure 4: For each detector/dataset pair (a circle), we compare the test loss between a baseline detector on the x-axis, with the test loss of the same detector calibrated on a clean calibration set $\bullet$ and on a noisy calibration set $\bullet$ on the y-axis. A clean calibration set is often the most efficient ($\bullet$ are mainly below the $y=x$ line). A noisy calibration set, although less efficient, is not significantly worse than the baseline ($\bullet$ are equally distributed around the $y=x$ line).
  • Figure 5: Distribution (boxplot) of the normalized (base 100, silver = training on correctly labeled examples only, base 200, none = training on all examples including mislabeled ones) test loss (lower is better) of the final classifier after the 3-stages pipeline of varying calibrated detectors over all datasets (the circles $\bullet$). Classifiers are calibrated using a clean calibration set with varying sizes, from 10 samples (white boxplot) to 1000 samples (colorful boxplot). There are diminishing returns in adding more calibration samples after 100 samples.
  • ...and 2 more figures