Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification
Matthias Rottmann, Marco Reese
TL;DR
This work tackles the problem of label errors in semantic segmentation datasets by elevating uncertainty quantification to the level of connected components. The authors introduce a principled method that, given a trained segmentation model, computes component-wise probabilities of correctness using an adjusted IoU and a calibrated meta classifier to produce $P(\mathrm{TP}_o)$ for each predicted component; false-positive components with high $P(\mathrm{TP}_o)$ are flagged as potential label errors. They establish a benchmark by synthetically dropping ground-truth components in CARLA and Cityscapes and evaluate on additional datasets (Pascal VOC, COCO-Stuff, ADE20K), reporting precision, recall, F1, and AP under various perturbations. Across datasets, the method achieves substantial precision (roughly 40–70%) while maintaining reasonable recall, significantly outperforming baselines and enabling efficient data-cleaning workflows. The findings demonstrate the practical utility of calibrated component-level uncertainty for improving dataset quality and benchmark reliability in semantic segmentation, with code made publicly available for replication and extension.
Abstract
In this work, we for the first time present a method for detecting label errors in image datasets with semantic segmentation, i.e., pixel-wise class labels. Annotation acquisition for semantic segmentation datasets is time-consuming and requires plenty of human labor. In particular, review processes are time consuming and label errors can easily be overlooked by humans. The consequences are biased benchmarks and in extreme cases also performance degradation of deep neural networks (DNNs) trained on such datasets. DNNs for semantic segmentation yield pixel-wise predictions, which makes detection of label errors via uncertainty quantification a complex task. Uncertainty is particularly pronounced at the transitions between connected components of the prediction. By lifting the consideration of uncertainty to the level of predicted components, we enable the usage of DNNs together with component-level uncertainty quantification for the detection of label errors. We present a principled approach to benchmarking the task of label error detection by dropping labels from the Cityscapes dataset as well from a dataset extracted from the CARLA driving simulator, where in the latter case we have the labels under control. Our experiments show that our approach is able to detect the vast majority of label errors while controlling the number of false label error detections. Furthermore, we apply our method to semantic segmentation datasets frequently used by the computer vision community and present a collection of label errors along with sample statistics.
