Table of Contents
Fetching ...

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Vanessa Emanuela Guarino, Claudia Winklmayr, Jannik Franzen, Josef Lorenz Rumberger, Manuel Pfeuffer, Sonja Greven, Klaus Maier-Hein, Carsten T. Lüth, Christoph Karg, Dagmar Kainmueller

Abstract

Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Abstract

Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

Paper Structure

This paper contains 74 sections, 16 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Aggregation strategies (AggS). (a) An AggS reduces an uncertainty map to a single scalar score. (b) Empirical evaluation of AggSs. Choosing an appropriate AggS is challenging and highly task-dependent. Hence, we benchmark different strategies in the context of OoD- and failure detection over different datasets. (c) Limitations of individual AggSs. Pixel-wise strategies, in particular, have key shortcomings e.g.1. AVG ignores spatial structure; 2. AQA lacks proportional invariance; and 3. ATA is not monotonic (see Supp. \ref{['app:formal_properties']} for more details). (d) Subtypes of AggSs. Beyond pixel-wise AggSs, we also explore spatially-aware approaches—such as prediction-based and spatial AggSs, which measure the fraction of uncertainty mass within structured regions of the uncertainty map. Finally, we consider Meta-AggSs, which combine intensity-based and spatial strategies by fitting a Gaussian Mixture Model (GMM).
  • Figure 2: Structural diversity of datasets. The scatterplot illustrates structural diversity by projecting uncertainty maps into the space of two SMR scores: MOR and EDS. High MOR indicates clustered uncertainty (low: noise), while high EDS reflects edge-localized uncertainty (low: flat regions). Notably, EDS effectively identifies OoD samples in CAR-ID/CAR-CS.
  • Figure 3: Performance on OoD Detection. Higher AUROC (0.6–1.0) indicates better iD–OoD separation. AggSs are ranked by their mean AUROC, computed over 500 bootstrap samples per dataset and then averaged across datasets for stability. Ranking robustness is assessed via one-sided Wilcoxon signed-rank tests at 5%, showing that prediction-based BCA and ICA, along with GMM-based methods, form a statistically dominant tier ($p < 0.05$). Numbers after AggS labels indicate method-specific parameters (cf. Supp. \ref{['app:aggs-params']}). Full analysis, including additional UQ methods and confidence intervals, is in Supp. \ref{['app:mean_rank_tables']}.
  • Figure 4: GMM Robustness for OoD Detection. (a) Fitting a GMM enables spatial methods to act as AggSs, capturing subtle non-linear changes in uncertainty maps. (b) Qualitative examples showing that effective AggSs for iD–OoD separation are data-dependent. (c) GMM-All AUROC: leave-one-out (top) and individual fitting (bottom) on a subset (CAR-CS, LIZ-IG, WEED-Hand, and WORM-Nem split). Using all AggSs generally matches or outperforms individual ones in 3 cases, with minimal impact from removing specific ones. Exceptions occur when a feature dominates (e.g., EDS for CAR-CS, marked as Tukey outlier) or when features lack discriminative power. (d) GMM-All AUROC: absolute SHAP values. Positive values indicate that an AggS aids GMM-All in separating iD–OoD; bi-directional or null contributions reduce performance (e.g., in LIZ-IG). SHAP values are shown for the same subsets as in (c).
  • Figure 5: Performance on FD. (a) Exemplary Selective Risk-Coverage curves. (b) E-AURC scores. Lower values indicate better alignment between uncertainty and prediction errors. AggSs are ranked by their mean E-AURC, computed over 500 bootstrap samples per dataset and then averaged across datasets for stability. Ranking robustness is assessed via one-sided Wilcoxon signed-rank tests at 5%, showing that prediction-based QFR shows statistically significant improvement over all other AggSs ($p < 0.001$), followed by BCA and GMM-based methods ($p < 0.05$). Full analysis, including additional UQ methods and confidence intervals, is in Supp. \ref{['app:mean_rank_tables']}.
  • ...and 8 more figures