Learning Confident Classifiers in the Presence of Label Noise
Asma Ahmed Hashmi, Aigerim Zhumabayeva, Nikita Kotelevskii, Artem Agafonov, Mohammad Yaqub, Maxim Panov, Martin Takáč
TL;DR
This work addresses learning under label noise when multiple annotators provide labels for each example, a common scenario in medical imaging. It introduces a probabilistic framework with a base predictor p_theta(y|x) and an annotator (confusion) network T^psi_r(x) to model per-annotator label noise, enabling recovery of ground-truth distributions via p^r_{theta,psi}(x) = T^psi_r(x) p_theta(x). A novel confidence-regularization objective is proposed to push the base predictor toward confident, near one-hot predictions, improving disentanglement of noise from true labels, with an additional loss for segmentation that leverages regions where all annotators agree (confident regions). Empirical results across synthetic and real-world datasets (MNIST, CIFAR-10, Fashion-MNIST, CIFAR-10N; LIDC; RIGA) show state-of-the-art or strong performance gains over several baselines, especially under heavy noise, and demonstrate improved Dice scores for segmentation with stable, robust behavior. The study provides practical impact for developing reliable classifiers and segmenters in settings with annotator disagreement, and it releases public code to facilitate further research and benchmarking.
Abstract
The success of Deep Neural Network (DNN) models significantly depends on the quality of provided annotations. In medical image segmentation, for example, having multiple expert annotations for each data point is common to minimize subjective annotation bias. Then, the goal of estimation is to filter out the label noise and recover the ground-truth masks, which are not explicitly given. This paper proposes a probabilistic model for noisy observations that allows us to build a confident classification and segmentation models. To accomplish it, we explicitly model label noise and introduce a new information-based regularization that pushes the network to recover the ground-truth labels. In addition, for segmentation task we adjust the loss function by prioritizing learning in high-confidence regions where all the annotators agree on labeling. We evaluate the proposed method on a series of classification tasks such as noisy versions of MNIST, CIFAR-10, Fashion-MNIST datasets as well as CIFAR-10N, which is real-world dataset with noisy human annotations. Additionally, for segmentation task, we consider several medical imaging datasets, such as, LIDC and RIGA that reflect real-world inter-variability among multiple annotators. Our experiments show that our algorithm outperforms state-of-the-art solutions for the considered classification and segmentation problems.
