Trusting Semantic Segmentation Networks
Samik Some, Vinay P. Namboodiri
TL;DR
This paper tackles the problem of trusting semantic segmentation outputs in deployment by analyzing how segmentation failures occur across multiple architectures and datasets and by evaluating test-time uncertainty metrics as proxies for misclassification. The authors survey uncertainty measures, propose practical, black-box metrics (notably per-pixel Entropy and Variation-based margins), and assess them via AUROC, precision, and recall across Cityscapes, ADE20K, and a domain-shift dataset (Dark Zurich). Their experiments show that simple metrics like Entropy and Probability Margin correlate with misclassifications with high recall and that dynamic thresholding can effectively flag likely errors without retraining, even under domain shift. The work provides actionable insights for building trust in segmentation systems and highlights thresholding strategies and metrics that are broadly applicable across architectures. Overall, Entropy-based signals offer a practical, low-overhead approach to assess and improve the reliability of semantic segmentation in real-world settings.
Abstract
Semantic segmentation has become an important task in computer vision with the growth of self-driving cars, medical image segmentation, etc. Although current models provide excellent results, they are still far from perfect and while there has been significant work in trying to improve the performance, both with respect to accuracy and speed of segmentation, there has been little work which analyses the failure cases of such systems. In this work, we aim to provide an analysis of how segmentation fails across different models and consider the question of whether these can be predicted reasonably at test time. To do so, we explore existing uncertainty-based metrics and see how well they correlate with misclassifications, allowing us to define the degree of trust we put in the output of our prediction models. Through several experiments on three different models across three datasets, we show that simple measures such as entropy can be used to capture misclassification with high recall rates.
