Table of Contents
Fetching ...

Conformal Performance Range Prediction for Segmentation Output Quality Control

Anna M. Wundram, Paul Fischer, Michael Muehlebach, Lisa M. Koch, Christian F. Baumgartner

TL;DR

This work addresses reliable estimation of segmentation output quality without ground truth by predicting performance ranges with statistical guarantees. It combines sampling-based segmentation uncertainty with split conformal prediction to produce intervals that contain the true DSC with probability at least $1-\alpha$. Evaluated on the FIVES retinal vessel dataset, five uncertainty estimation methods are compared, with PHiSeg achieving the best balance of accurate predictions, coverage, and tight interval sizes; yet low-quality images create larger uncertainty. The results demonstrate the practical value of conformal performance prediction for output quality control, while acknowledging limitations from exchangeability assumptions and OOD settings, and suggesting extensions to domain-shift scenarios in future work.

Abstract

Recent works have introduced methods to estimate segmentation performance without ground truth, relying solely on neural network softmax outputs. These techniques hold potential for intuitive output quality control. However, such performance estimates rely on calibrated softmax outputs, which is often not the case in modern neural networks. Moreover, the estimates do not take into account inherent uncertainty in segmentation tasks. These limitations may render precise performance predictions unattainable, restricting the practical applicability of performance estimation methods. To address these challenges, we develop a novel approach for predicting performance ranges with statistical guarantees of containing the ground truth with a user specified probability. Our method leverages sampling-based segmentation uncertainty estimation to derive heuristic performance ranges, and applies split conformal prediction to transform these estimates into rigorous prediction ranges that meet the desired guarantees. We demonstrate our approach on the FIVES retinal vessel segmentation dataset and compare five commonly used sampling-based uncertainty estimation techniques. Our results show that it is possible to achieve the desired coverage with small prediction ranges, highlighting the potential of performance range prediction as a valuable tool for output quality control.

Conformal Performance Range Prediction for Segmentation Output Quality Control

TL;DR

This work addresses reliable estimation of segmentation output quality without ground truth by predicting performance ranges with statistical guarantees. It combines sampling-based segmentation uncertainty with split conformal prediction to produce intervals that contain the true DSC with probability at least . Evaluated on the FIVES retinal vessel dataset, five uncertainty estimation methods are compared, with PHiSeg achieving the best balance of accurate predictions, coverage, and tight interval sizes; yet low-quality images create larger uncertainty. The results demonstrate the practical value of conformal performance prediction for output quality control, while acknowledging limitations from exchangeability assumptions and OOD settings, and suggesting extensions to domain-shift scenarios in future work.

Abstract

Recent works have introduced methods to estimate segmentation performance without ground truth, relying solely on neural network softmax outputs. These techniques hold potential for intuitive output quality control. However, such performance estimates rely on calibrated softmax outputs, which is often not the case in modern neural networks. Moreover, the estimates do not take into account inherent uncertainty in segmentation tasks. These limitations may render precise performance predictions unattainable, restricting the practical applicability of performance estimation methods. To address these challenges, we develop a novel approach for predicting performance ranges with statistical guarantees of containing the ground truth with a user specified probability. Our method leverages sampling-based segmentation uncertainty estimation to derive heuristic performance ranges, and applies split conformal prediction to transform these estimates into rigorous prediction ranges that meet the desired guarantees. We demonstrate our approach on the FIVES retinal vessel segmentation dataset and compare five commonly used sampling-based uncertainty estimation techniques. Our results show that it is possible to achieve the desired coverage with small prediction ranges, highlighting the potential of performance range prediction as a valuable tool for output quality control.
Paper Structure (15 sections, 7 equations, 5 figures)

This paper contains 15 sections, 7 equations, 5 figures.

Figures (5)

  • Figure 1: Overview. (a) Given a fundus image, we predict a vessel segmentation, the expected Dice-Sørensen Coefficient (DSC), as well as upper and lower bounds for the expected DSC. (b) Conformal prediction allows us to set the performance range such that at most $\alpha=10\%$ percent of the test cases have a DSC outside the predicted interval. We show a confident case with low DSC prediction uncertainty (green), as well as a case with high DSC prediction uncertainty due to poor image quality (red).
  • Figure 2: Quantitative analysis. (a) performance prediction absolute error, (b) marginal and conditional coverage for very small (0, 0.1], small (0.1, 2], large (0.2, 5], very large (0.5, 1] interval sizes, and (c) interval sizes for all investigated methods
  • Figure 3: Visualisation of performance ranges. Performance predictions $\hat{y}$ (green/red), ground truth DSC scores $y$ (black), and performance ranges $[\hat{y}_l, \hat{y}_l]$ (gray) for all images in the test set. The images are sorted by ground-truth performance.
  • Figure 4: PHiSeg segmentations and performance predictions for three examples. Top: An example with good segmentation performance. Middle: An example with poor image quality and poor segmentation performance. Bottom: An example with good image quality but poor segmentation performance due to visible pathologies stemming from advanced diabetic retinopathy.
  • Figure 5: Marginal and conditional coverage for multiple interval sizes: very small (0, 0.1], small (0.1, 2], large (0.2, 5]), very large (0.5, 1]) for high quality images (left) and poor quality images (right).