Calibration improves detection of mislabeled examples
Ilies Chibane, Thomas George, Pierre Nodet, Vincent Lemaire
TL;DR
This paper tackles mislabeled data by examining model-probing detectors whose trust scores rely on a base model's confidences. It proposes a simple calibration step prior to probing, using isotonic regression or Platt scaling, to produce more reliable scores across models and ensembles. Across a large benchmark of 19 multiclass, weak-label datasets, calibrated detectors consistently improve the end-to-end detection–filtering–training pipeline, including when the calibration data itself contains label noise or is small. The work argues that calibration is a practical, low-cost enhancement that yields robust gains in mislabel detection and downstream classifier performance, especially in class-imbalanced or underrepresented subgroups.
Abstract
Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
