Table of Contents
Fetching ...

Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation

Prerak Mody, Nicolas F. Chaves-de-Plaza, Chinmay Rao, Eleftheria Astrenidou, Mischa de Ridder, Nienke Hoekstra, Klaus Hildebrandt, Marius Staring

TL;DR

This work tackles the gap between Bayesian uncertainty and practical error detection in deep medical image segmentation by introducing an Accuracy-vs-Uncertainty (AvU) loss. A FlipOut-based UNet is trained to produce predictive uncertainty that is concentrated in inaccurate voxels, improving the uncertainty-utility for semi-automated QA. The approach is validated on head-and-neck CT and prostate MR datasets, showing improved uncertainty-error correspondence (ROC/PRC) while maintaining or improving segmentation performance and calibration. The AvU loss is lightweight to integrate and demonstrates potential to enhance clinical QA workflows by reducing time spent on evaluating accurate regions.

Abstract

Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at https://github.com/prerakmody/bayesuncertainty-error-correspondence

Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation

TL;DR

This work tackles the gap between Bayesian uncertainty and practical error detection in deep medical image segmentation by introducing an Accuracy-vs-Uncertainty (AvU) loss. A FlipOut-based UNet is trained to produce predictive uncertainty that is concentrated in inaccurate voxels, improving the uncertainty-utility for semi-automated QA. The approach is validated on head-and-neck CT and prostate MR datasets, showing improved uncertainty-error correspondence (ROC/PRC) while maintaining or improving segmentation performance and calibration. The AvU loss is lightweight to integrate and demonstrates potential to enhance clinical QA workflows by reducing time spent on evaluating accurate regions.

Abstract

Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at https://github.com/prerakmody/bayesuncertainty-error-correspondence
Paper Structure (33 sections, 8 equations, 5 figures, 10 tables)

This paper contains 33 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Method overview - A 3D medical scan (e.g. CT/MR) is input into a UNet-based Bayesian neural net to produce both predicted contours (Pred) and predictive uncertainty (Unc). While the cross-entropy loss is used to improve segmentation performance, the Accuracy-vs-Uncertainty (AvU) loss is used to improve uncertainty-error correspondence. The AvU loss is computed by comparing the prediction with the ground truth (GT) at a specific uncertainty threshold using four terms: count of accurate-and-certain ($n_\mathrm{AC}$), accurate-and-uncertain ($n_\mathrm{AU}$), inaccurate-and-certain ($n_\mathrm{IC}$) and inaccurate-and-uncertain ($n_\mathrm{IU}$) voxels.
  • Figure 2: Uncertainty-error correspondence for the head-and-neck (H&N) CT (a,b) dataset. Slices of the CT scans are shown in pairs to understand the 3D nature of segmentation uncertainty heatmaps. The color bar on the right depicts the range of uncertainty values while green and blue are used for ground truth and prediction contours respectively.
  • Figure 3: Uncertainty-error correspondence for the Prostate MR (a,b) dataset. Slices of the MR scans are shown in pairs to understand the 3D nature of segmentation uncertainty heatmaps. The color bar on the right depicts the range of uncertainty values while green and blue are used for ground truth and prediction contours respectively.
  • Figure 4: The figures above show the distribution of the uncertainty-error correspondence metrics as curves and boxplots (with swarm plots) for patients from the RTOG clinical trial (a-f) as well as for the Medical Decathlon (Prostate) dataset (g-l). We only evaluate up to the maximum uncertainty of each dataset as the metrics do not change beyond that.
  • Figure 5: The green and blue contours in a) show the ground truth (GT) and predicted contours. In b) we see the inaccuracy map in black, while c) and d) show the smaller segmentation “ errors” and larger segmentation “ failures” respectively.