Table of Contents
Fetching ...

Structural-Based Uncertainty in Deep Learning Across Anatomical Scales: Analysis in White Matter Lesion Segmentation

Nataliia Molchanova, Vatsal Raina, Andrey Malinin, Francesco La Rosa, Adrien Depeursinge, Mark Gales, Cristina Granziera, Henning Muller, Mara Graziani, Meritxell Bach Cuadra

TL;DR

This paper explores uncertainty quantification (UQ) as an indicator of the trustworthiness of automated deep-learning tools in the context of white matter lesion segmentation from magnetic resonance imaging (MRI) scans of multiple sclerosis (MS) patients.

Abstract

This paper explores uncertainty quantification (UQ) as an indicator of the trustworthiness of automated deep-learning (DL) tools in the context of white matter lesion (WML) segmentation from magnetic resonance imaging (MRI) scans of multiple sclerosis (MS) patients. Our study focuses on two principal aspects of uncertainty in structured output segmentation tasks. First, we postulate that a reliable uncertainty measure should indicate predictions likely to be incorrect with high uncertainty values. Second, we investigate the merit of quantifying uncertainty at different anatomical scales (voxel, lesion, or patient). We hypothesize that uncertainty at each scale is related to specific types of errors. Our study aims to confirm this relationship by conducting separate analyses for in-domain and out-of-domain settings. Our primary methodological contributions are (i) the development of novel measures for quantifying uncertainty at lesion and patient scales, derived from structural prediction discrepancies, and (ii) the extension of an error retention curve analysis framework to facilitate the evaluation of UQ performance at both lesion and patient scales. The results from a multi-centric MRI dataset of 444 patients demonstrate that our proposed measures more effectively capture model errors at the lesion and patient scales compared to measures that average voxel-scale uncertainty values. We provide the UQ protocols code at https://github.com/Medical-Image-Analysis-Laboratory/MS_WML_uncs.

Structural-Based Uncertainty in Deep Learning Across Anatomical Scales: Analysis in White Matter Lesion Segmentation

TL;DR

This paper explores uncertainty quantification (UQ) as an indicator of the trustworthiness of automated deep-learning tools in the context of white matter lesion segmentation from magnetic resonance imaging (MRI) scans of multiple sclerosis (MS) patients.

Abstract

This paper explores uncertainty quantification (UQ) as an indicator of the trustworthiness of automated deep-learning (DL) tools in the context of white matter lesion (WML) segmentation from magnetic resonance imaging (MRI) scans of multiple sclerosis (MS) patients. Our study focuses on two principal aspects of uncertainty in structured output segmentation tasks. First, we postulate that a reliable uncertainty measure should indicate predictions likely to be incorrect with high uncertainty values. Second, we investigate the merit of quantifying uncertainty at different anatomical scales (voxel, lesion, or patient). We hypothesize that uncertainty at each scale is related to specific types of errors. Our study aims to confirm this relationship by conducting separate analyses for in-domain and out-of-domain settings. Our primary methodological contributions are (i) the development of novel measures for quantifying uncertainty at lesion and patient scales, derived from structural prediction discrepancies, and (ii) the extension of an error retention curve analysis framework to facilitate the evaluation of UQ performance at both lesion and patient scales. The results from a multi-centric MRI dataset of 444 patients demonstrate that our proposed measures more effectively capture model errors at the lesion and patient scales compared to measures that average voxel-scale uncertainty values. We provide the UQ protocols code at https://github.com/Medical-Image-Analysis-Laboratory/MS_WML_uncs.
Paper Structure (36 sections, 8 equations, 9 figures, 6 tables)

This paper contains 36 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of the domain shift between the in-domain datasets (Train, Val, Test$_{in}$) and the out-of-domain dataset (Test$_{out}$, Test$_{private}$, and Test$_{WMH}$) brought by the differences in the MS stages and medical centers. On the left, the plot of the total lesion volume in milliliters versus the number of lesions per scan for in-domain (orange) and out-of-domain (gray and black) sets reveals the difference in the lesion load (as a proxy to an MS stage) between different domains. On the right, typical examples from the Test$_{in}$ and Test$_{out}$ sets illustrate the difference in the lesion load, as well as the intensity differences brought by the change of the medical center (i.e. scanner, technicians, annotators, and other parameters contributing to the domain shift) and MS stages (i.e. smaller lesion load and size).
  • Figure 2: An illustration of a Dice score retention curve (DSC-RC) for assessing the correspondence between voxel uncertainty (MEASURE$_1$ and MEASURE$_2$) and segmentation quality measured by DSC. DSC$_0$ - quality of the predicted segmentation before voxel replacement. IDEAL and RANDOM RCs are built for the ideal and random uncertainty and are the upper and lower bounds of the uncertainty-robustness performance.
  • Figure 3: Error retention curves for the assessment of uncertainty measures at the voxel, lesion, and patient anatomical scales across the in-domain Test$_{in}$ (left column) and the out-of-domain Test$_{out}$ (center column) and Test$_{private}$ (left column) sets for the nnU-Net model. Different rows correspond to different anatomical scales indicated with icons on the left. The voxel-scale DSC-RCs and lesion-scale LPPV-RCs were obtained by averaging across the respective datasets. At each of the scales, the ideal (black dashed) line indicates the upper bound of an uncertainty measure performance in its ability to capture model errors; the random (gray dashed) indicates no relationship between an uncertainty measure and error; a worse-than-random performance indicates an inverse relationship. Analogous results for the SB model are shown in \ref{['appendix:erc-sb']}.
  • Figure 4: The relationship between DSC and patient-scale uncertainty is assessed for Test$_{in}$ (orange), Test$_{out}$ (gray), Test$_{private}$ (light gray), and Test$_{WMH}$ (black) separately and jointly for the nnU-Net model. The presented uncertainty measures were chosen based on the results of the error RC analysis (Figure \ref{['fig:erc']} and Table \ref{['tab:erc']}) to illustrate the relationship between DSC and uncertainty brought by measures with the highest (proposed $PSU^{(+)}$), median (proposed $\overline{LSU^{(+)}}$), and worse-than-random ($\overline{NC}_B$ and $\overline{EoE}_B$) DSC-AUC values. Results for other measures and for the SB model can be found in \ref{['appendices:corr-pat']}.
  • Figure 5: Examples of uncertainty maps at the voxel and lesion scales and patient uncertainty values. The two left columns illustrate axial slices of a FLAIR scan with the ground truth (in yellow) and predicted (in pink) WML masks; the middle column - voxel-scale uncertainty maps computed with the ${EoE}_i$ measure; the fourth column - lesion-scale uncertainty maps computed with the proposed ${LSU}^+$; the fifth column - the patient-scale uncertainty value computed with the proposed ${PSU}^+$. The choice of measures is based on the results of the error retention curves analysis. (A), (B), (C), and (D) represent different scenarios with gradually decreasing DSC. Cases (A) and (B) represent good and mediocre model performance, respectively. Patient (C) has an atypical large lesion, which the algorithm fails as expected. Patient (D) was not correctly preprocessed (the skull is not removed) which led to the algorithm's low performance and high patient uncertainty.
  • ...and 4 more figures