Table of Contents
Fetching ...

Confidence Intervals for Performance Estimates in Brain MRI Segmentation

R. El Jurdi, G. Varoquaux, O. Colliot

TL;DR

This work analyzes how confidence intervals for segmentation performance in 3D brain MRI are shaped by test-set size and metric variability, using nnU-net on hippocampus and brain tumor tasks with Dice and Hausdorff metrics. It compares bootstrap CIs with parametric normal-approximation CIs, showing that the latter closely approximate the former even for non-Gaussian metric distributions. The authors demonstrate that segmentation requires fewer test samples than classification to achieve a given CI width, but highlight that harder tasks and metrics with larger dispersion demand larger test sets. The study provides practical guidance and tables to plan CI-aware evaluation and advocates routine reporting of CIs to improve reproducibility and interpretability in medical image segmentation.

Abstract

Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in the context of segmentation in 3D brain magnetic resonance imaging (MRI). We carry experiments on using the standard nnU-net framework, two datasets from the Medical Decathlon challenge that concern brain MRI (hippocampus and brain tumor segmentation) and two performance measures: the Dice Similarity Coefficient and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1\% wide confidence interval requires about 100-200 test samples when the spread is low (standard-deviation around 3\%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples.

Confidence Intervals for Performance Estimates in Brain MRI Segmentation

TL;DR

This work analyzes how confidence intervals for segmentation performance in 3D brain MRI are shaped by test-set size and metric variability, using nnU-net on hippocampus and brain tumor tasks with Dice and Hausdorff metrics. It compares bootstrap CIs with parametric normal-approximation CIs, showing that the latter closely approximate the former even for non-Gaussian metric distributions. The authors demonstrate that segmentation requires fewer test samples than classification to achieve a given CI width, but highlight that harder tasks and metrics with larger dispersion demand larger test sets. The study provides practical guidance and tables to plan CI-aware evaluation and advocates routine reporting of CIs to improve reproducibility and interpretability in medical image segmentation.

Abstract

Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in the context of segmentation in 3D brain magnetic resonance imaging (MRI). We carry experiments on using the standard nnU-net framework, two datasets from the Medical Decathlon challenge that concern brain MRI (hippocampus and brain tumor segmentation) and two performance measures: the Dice Similarity Coefficient and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1\% wide confidence interval requires about 100-200 test samples when the spread is low (standard-deviation around 3\%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples.
Paper Structure (17 sections, 8 equations, 5 figures, 11 tables)

This paper contains 17 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Histogram of Dice Similarity Coefficient over the entire test set, shown together with a kernel density estimation (KDE) which smoothes the observations with a Gaussian kernel. (a) Hippocampus dataset, 3D U-Net. (b) Hippocampus dataset, 2D U-Net. (c) Brain Tumor dataset, 3D U-Net. (d) Brain Tumor dataset, 2D U-Net.
  • Figure 2: Histogram of 95% Hausdorff distance over the entire test set, shown together with a kernel density estimation (KDE) which smoothes the observations with a Gaussian kernel. (a) Hippocampus dataset, 3D U-Net. (b) Hippocampus dataset, 2D U-Net. (c) Brain Tumor dataset, 3D U-Net. (d) Brain Tumor dataset, 2D U-Net.
  • Figure 3: Parametric and boostrap CIs: Dice Similarity Coefficient (detailed results in Tables \ref{['table:Hippo-DSC-3D']}, \ref{['table:Hippo-DSC-2D']}, \ref{['table:Brain-DSC-3D']}, \ref{['table:Brain-DSC-2D']})
  • Figure 4: Parametric and boostrap CIs: Hausdorff Distance (detailed results in Tables \ref{['table:Hippo-HD-3D']}, \ref{['table:Hippo-HD-2D']}, \ref{['table:Brain-HD-3D']}, \ref{['table:Brain-HD-2D']} )
  • Figure 5: Histogram of the test-set size of different experiments from segmentation papers published in JMI, TMI, and MedIA in 2022. The red line represents the median test size ($n=25$).