Table of Contents
Fetching ...

A methodology for clinically driven interactive segmentation evaluation

Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso

TL;DR

The paper tackles the lack of clinically realistic evaluation for interactive medical image segmentation and proposes a clinically grounded evaluation framework plus a modular software pipeline to standardize prompts, tasks, and metrics. It benchmarks several interactive models (e.g., $SAM2$, $SAM-Med2D$, $SAM-Med3D$, $SegVol$) across diverse tasks with varying voxel counts, anisotropy, and target geometries, using metrics such as $Dice$, $NSD$, and interaction-normalised $nAUC$. Key findings show that minimising information loss during prompting and employing adaptive zooming improve robustness; 2D methods excel on slab-like targets while true 3D context benefits large or irregular targets, and zero-shot medical-domain models can struggle with low-contrast, complex shapes. The framework enables fair, deployment-relevant benchmarking, informs future user studies on prompting effort, and highlights directions for expanding algorithm coverage and ensuring proper data-use separation.

Abstract

Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.

A methodology for clinically driven interactive segmentation evaluation

TL;DR

The paper tackles the lack of clinically realistic evaluation for interactive medical image segmentation and proposes a clinically grounded evaluation framework plus a modular software pipeline to standardize prompts, tasks, and metrics. It benchmarks several interactive models (e.g., , , , ) across diverse tasks with varying voxel counts, anisotropy, and target geometries, using metrics such as , , and interaction-normalised . Key findings show that minimising information loss during prompting and employing adaptive zooming improve robustness; 2D methods excel on slab-like targets while true 3D context benefits large or irregular targets, and zero-shot medical-domain models can struggle with low-contrast, complex shapes. The framework enables fair, deployment-relevant benchmarking, informs future user studies on prompting effort, and highlights directions for expanding algorithm coverage and ensuring proper data-use separation.

Abstract

Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.

Paper Structure

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: A flowchart for identifying compatible experiments that are passed to the proposed evaluation framework for interaction simulation and evaluation.
  • Figure 2: Algorithm fingerprints. Ticks: full support; Tildes: partial support; Crosses: no support. Implicit editing uses full prompt memory, explicit only uses current prompts, and atomic editing re-does inference with full prompt memory. All except SAM-Med3D support scribbles (represented as a set of points); SAM-Med3D is limited by a one-point-per-class constraint. Only SAM2 natively supports multi-channel radiological images.
  • Figure 3: Top: Median Dice and NSD across the refinement simulations as image voxel count increases (hippocampus, brain tumour core, pancreas). Bottom: Summary metrics for experiment 1; all metrics but NoF report dataset medians, NoF reports percentages. Bold indicates best metric per task.
  • Figure 4: Top: Median Dice and NSD across refinement simulations for the whole prostate and brain tumour core tasks. Axes of complexity: Highly anisotropic versus isotropic images, and spherical targets versus irregularly shaped targets. Bottom: Summary metrics for experiments 2 & 3; all metrics but NoF report dataset medians, NoF reports percentages. Bold indicates best metric per task.
  • Figure 5: Top: Median Dice and NSD across refinement simulations for variation in target size relative to the image volume (lung lesion versus whole pancreas). Bottom: Summary metrics for experiment 4; all metrics but NoF report dataset medians, NoF reports percentages. Bold indicates best metric per task.