Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

Matthias Rottmann; Marco Reese

Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

Matthias Rottmann, Marco Reese

TL;DR

This work tackles the problem of label errors in semantic segmentation datasets by elevating uncertainty quantification to the level of connected components. The authors introduce a principled method that, given a trained segmentation model, computes component-wise probabilities of correctness using an adjusted IoU and a calibrated meta classifier to produce $P(\mathrm{TP}_o)$ for each predicted component; false-positive components with high $P(\mathrm{TP}_o)$ are flagged as potential label errors. They establish a benchmark by synthetically dropping ground-truth components in CARLA and Cityscapes and evaluate on additional datasets (Pascal VOC, COCO-Stuff, ADE20K), reporting precision, recall, F1, and AP under various perturbations. Across datasets, the method achieves substantial precision (roughly 40–70%) while maintaining reasonable recall, significantly outperforming baselines and enabling efficient data-cleaning workflows. The findings demonstrate the practical utility of calibrated component-level uncertainty for improving dataset quality and benchmark reliability in semantic segmentation, with code made publicly available for replication and extension.

Abstract

In this work, we for the first time present a method for detecting label errors in image datasets with semantic segmentation, i.e., pixel-wise class labels. Annotation acquisition for semantic segmentation datasets is time-consuming and requires plenty of human labor. In particular, review processes are time consuming and label errors can easily be overlooked by humans. The consequences are biased benchmarks and in extreme cases also performance degradation of deep neural networks (DNNs) trained on such datasets. DNNs for semantic segmentation yield pixel-wise predictions, which makes detection of label errors via uncertainty quantification a complex task. Uncertainty is particularly pronounced at the transitions between connected components of the prediction. By lifting the consideration of uncertainty to the level of predicted components, we enable the usage of DNNs together with component-level uncertainty quantification for the detection of label errors. We present a principled approach to benchmarking the task of label error detection by dropping labels from the Cityscapes dataset as well from a dataset extracted from the CARLA driving simulator, where in the latter case we have the labels under control. Our experiments show that our approach is able to detect the vast majority of label errors while controlling the number of false label error detections. Furthermore, we apply our method to semantic segmentation datasets frequently used by the computer vision community and present a collection of label errors along with sample statistics.

Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

TL;DR

for each predicted component; false-positive components with high

are flagged as potential label errors. They establish a benchmark by synthetically dropping ground-truth components in CARLA and Cityscapes and evaluate on additional datasets (Pascal VOC, COCO-Stuff, ADE20K), reporting precision, recall, F1, and AP under various perturbations. Across datasets, the method achieves substantial precision (roughly 40–70%) while maintaining reasonable recall, significantly outperforming baselines and enabling efficient data-cleaning workflows. The findings demonstrate the practical utility of calibrated component-level uncertainty for improving dataset quality and benchmark reliability in semantic segmentation, with code made publicly available for replication and extension.

Abstract

Paper Structure (25 sections, 3 equations, 10 figures, 9 tables)

This paper contains 25 sections, 3 equations, 10 figures, 9 tables.

Introduction
Related Work
Label Error Detection Method
Estimating the probability of a prediction being correct.
Detection of label errors.
Datasets and Benchmarks for Label Error Detection in Semantic Segmentation
Definition of label error.
Benchmark definition.
Datasets.
Evaluation protocol.
Numerical Experiments
Experiment setup.
Experiments with Induced Label Errors
Label Errors in Frequently used Datasets
Conclusion & Outlook
...and 10 more sections

Figures (10)

Figure 1: A visualization of our method for label error detection.
Figure 2: Two examples of label errors detected by our method, one example per row. Left: prediction of our label error detection method; right: "ground truth" annotation. Our method is able to find both, overlooked and flipped labels.
Figure 3: Two detected label errors in PascalVOC. The visualization scheme follows the one of \ref{['fig:cs']}.
Figure 4: Two annotations, before and after we applied smoothing. Note that, we do not smooth every class. Smoothing is applied onto the classes of pedestrians, poles, vegetation and vehicles. Additionally, we remove road lines from the annotation.
Figure 5: The connected component shown here is originally labeled as void. Our method predicts a label error here which is likely true. However, since we cannot confirm this without any doubt we validated it as false positive.
...and 5 more figures

Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

TL;DR

Abstract

Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

Authors

TL;DR

Abstract

Table of Contents

Figures (10)