Table of Contents
Fetching ...

Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Jakob Lønborg Christensen, Vedrana Andersen Dahl, Morten Rieger Hannemose, Anders Bjorholm Dahl, Christian F. Baumgartner

Abstract

Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Abstract

Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.
Paper Structure (17 sections, 2 equations, 15 figures, 3 tables)

This paper contains 17 sections, 2 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: A visualization of aleatoric and epistemic modeling components and their aggregation into uncertainty measures for downstream tasks.
  • Figure 2: Example images from the LIDC-IDRI dataset, showing the 4 annotator delineations for lung nodules as red lines.
  • Figure 3: Example images from the Chaksu dataset, showing the 5 annotator delineations for cup and disc as blue and green lines respectively (for the first 5 images). Each row is data from a different scanning device.
  • Figure 4: A visualization of the entanglement measure $\Delta$, shown for an uncertainty measure where higher is better. Point 1 is disentangled, while point 2 is entangled. The metric is proportional to the signed angles (shown as $\phi_1$ and $\phi_2$) to the $U_c=U_w$ line.
  • Figure 5: Mean model predictions ($\mathbb{E}_\theta[\mathbb{E}_y[p]]$) on LIDC data (ID).
  • ...and 10 more figures