Table of Contents
Fetching ...

Trusting Semantic Segmentation Networks

Samik Some, Vinay P. Namboodiri

TL;DR

This paper tackles the problem of trusting semantic segmentation outputs in deployment by analyzing how segmentation failures occur across multiple architectures and datasets and by evaluating test-time uncertainty metrics as proxies for misclassification. The authors survey uncertainty measures, propose practical, black-box metrics (notably per-pixel Entropy and Variation-based margins), and assess them via AUROC, precision, and recall across Cityscapes, ADE20K, and a domain-shift dataset (Dark Zurich). Their experiments show that simple metrics like Entropy and Probability Margin correlate with misclassifications with high recall and that dynamic thresholding can effectively flag likely errors without retraining, even under domain shift. The work provides actionable insights for building trust in segmentation systems and highlights thresholding strategies and metrics that are broadly applicable across architectures. Overall, Entropy-based signals offer a practical, low-overhead approach to assess and improve the reliability of semantic segmentation in real-world settings.

Abstract

Semantic segmentation has become an important task in computer vision with the growth of self-driving cars, medical image segmentation, etc. Although current models provide excellent results, they are still far from perfect and while there has been significant work in trying to improve the performance, both with respect to accuracy and speed of segmentation, there has been little work which analyses the failure cases of such systems. In this work, we aim to provide an analysis of how segmentation fails across different models and consider the question of whether these can be predicted reasonably at test time. To do so, we explore existing uncertainty-based metrics and see how well they correlate with misclassifications, allowing us to define the degree of trust we put in the output of our prediction models. Through several experiments on three different models across three datasets, we show that simple measures such as entropy can be used to capture misclassification with high recall rates.

Trusting Semantic Segmentation Networks

TL;DR

This paper tackles the problem of trusting semantic segmentation outputs in deployment by analyzing how segmentation failures occur across multiple architectures and datasets and by evaluating test-time uncertainty metrics as proxies for misclassification. The authors survey uncertainty measures, propose practical, black-box metrics (notably per-pixel Entropy and Variation-based margins), and assess them via AUROC, precision, and recall across Cityscapes, ADE20K, and a domain-shift dataset (Dark Zurich). Their experiments show that simple metrics like Entropy and Probability Margin correlate with misclassifications with high recall and that dynamic thresholding can effectively flag likely errors without retraining, even under domain shift. The work provides actionable insights for building trust in segmentation systems and highlights thresholding strategies and metrics that are broadly applicable across architectures. Overall, Entropy-based signals offer a practical, low-overhead approach to assess and improve the reliability of semantic segmentation in real-world settings.

Abstract

Semantic segmentation has become an important task in computer vision with the growth of self-driving cars, medical image segmentation, etc. Although current models provide excellent results, they are still far from perfect and while there has been significant work in trying to improve the performance, both with respect to accuracy and speed of segmentation, there has been little work which analyses the failure cases of such systems. In this work, we aim to provide an analysis of how segmentation fails across different models and consider the question of whether these can be predicted reasonably at test time. To do so, we explore existing uncertainty-based metrics and see how well they correlate with misclassifications, allowing us to define the degree of trust we put in the output of our prediction models. Through several experiments on three different models across three datasets, we show that simple measures such as entropy can be used to capture misclassification with high recall rates.
Paper Structure (6 sections, 5 figures, 9 tables)

This paper contains 6 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of misclassified pixels and entropy of DRN, OneFormer and SegFormer networks on a couple of Cityscapes validation images, along with the images themselves and edges detected using the Scharr operator. It is to be noted that OneFormer generally produces high entropy outputs, which cause most of the image to be grey. However, we can still see that the highest entropy regions (in bright white) still correspond well to misclassified pixels.
  • Figure 2: Cumulative histogram of entropy values for correctly classified (blue) and misclassified pixels (orange) for a couple of experimental settings.
  • Figure 3: An example of how DRN misclassifies inputs when noise is added and how well entropy can capture it. From left to right, we have the original image followed by increasing amounts of noise added to it. The bottom row shows misclassified pixels with high entropy in green, misclassified pixels with low entropy in red and correctly classified pixels with high entropy in blue.
  • Figure 4: A few images showing a couple of best performing classes in the ADE20K dataset. The first two images include the washing machine class while the last two include the tent class. For each image, we show from left to right, the image, the highlighted class, misclassified pixels, entropy, thresholded entropy, detection mask.
  • Figure 5: A few images showing a couple of worst performing classes in the ADE20K dataset. The first two images include dirt track class while the last two include the lake class. For each image, we show from left to right, the image, the highlighted class, misclassified pixels, entropy, thresholded entropy, detection mask.