Table of Contents
Fetching ...

RadioActive: 3D Radiological Interactive Segmentation Benchmark

Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein

TL;DR

RadioActive tackles the gap in 3D radiological interactive segmentation by providing an open, extensible benchmark that evaluates both 2D and 3D prompting strategies across ten diverse datasets using a standardized evaluation protocol. The framework introduces realistic prompting schemes (interpolation and propagation) and scribble-based refinement to meaningfully reduce human effort while enabling iterative refinement. Across seven models, it reveals that SAM2 can outperform specialized medical models under realistic prompting, and that simple interpolation strategies can match slice-by-slice prompting, with iterative refinement further boosting accuracy. Bounding-box prompts generally outperform point prompts, while 2D prompting can rival 3D prompting when facilitated by effective prompting; however, 3D models still struggle on some MRI tasks and large structures. By open-sourcing RadioActive, the authors provide a reproducible, community-driven platform to accelerate progress in interactive 3D medical image segmentation and its clinical adoption.

Abstract

Effortless and precise segmentation with minimal clinician effort could greatly streamline clinical workflows. Recent interactive segmentation models, inspired by METAs Segment Anything, have made significant progress but face critical limitations in 3D radiology. These include impractical human interaction requirements such as slice-by-slice operations for 2D models on 3D data and a lack of iterative refinement. Prior studies have been hindered by inadequate evaluation protocols, resulting in unreliable performance assessments and inconsistent findings across studies. The RadioActive benchmark addresses these challenges by providing a rigorous and reproducible evaluation framework for interactive segmentation methods in clinically relevant scenarios. It features diverse datasets, a wide range of target structures, and the most impactful 2D and 3D interactive segmentation methods, all within a flexible and extensible codebase. We also introduce advanced prompting techniques that reduce interaction steps, enabling fair comparisons between 2D and 3D models. Surprisingly, SAM2 outperforms all specialized medical 2D and 3D models in a setting requiring only a few interactions to generate prompts for a 3D volume. This challenges prevailing assumptions and demonstrates that general-purpose models surpass specialized medical approaches. By open-sourcing RadioActive, we invite researchers to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of 3D medical interactive models.

RadioActive: 3D Radiological Interactive Segmentation Benchmark

TL;DR

RadioActive tackles the gap in 3D radiological interactive segmentation by providing an open, extensible benchmark that evaluates both 2D and 3D prompting strategies across ten diverse datasets using a standardized evaluation protocol. The framework introduces realistic prompting schemes (interpolation and propagation) and scribble-based refinement to meaningfully reduce human effort while enabling iterative refinement. Across seven models, it reveals that SAM2 can outperform specialized medical models under realistic prompting, and that simple interpolation strategies can match slice-by-slice prompting, with iterative refinement further boosting accuracy. Bounding-box prompts generally outperform point prompts, while 2D prompting can rival 3D prompting when facilitated by effective prompting; however, 3D models still struggle on some MRI tasks and large structures. By open-sourcing RadioActive, the authors provide a reproducible, community-driven platform to accelerate progress in interactive 3D medical image segmentation and its clinical adoption.

Abstract

Effortless and precise segmentation with minimal clinician effort could greatly streamline clinical workflows. Recent interactive segmentation models, inspired by METAs Segment Anything, have made significant progress but face critical limitations in 3D radiology. These include impractical human interaction requirements such as slice-by-slice operations for 2D models on 3D data and a lack of iterative refinement. Prior studies have been hindered by inadequate evaluation protocols, resulting in unreliable performance assessments and inconsistent findings across studies. The RadioActive benchmark addresses these challenges by providing a rigorous and reproducible evaluation framework for interactive segmentation methods in clinically relevant scenarios. It features diverse datasets, a wide range of target structures, and the most impactful 2D and 3D interactive segmentation methods, all within a flexible and extensible codebase. We also introduce advanced prompting techniques that reduce interaction steps, enabling fair comparisons between 2D and 3D models. Surprisingly, SAM2 outperforms all specialized medical 2D and 3D models in a setting requiring only a few interactions to generate prompts for a 3D volume. This challenges prevailing assumptions and demonstrates that general-purpose models surpass specialized medical approaches. By open-sourcing RadioActive, we invite researchers to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of 3D medical interactive models.

Paper Structure

This paper contains 40 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Current interactive segmentation methods require clinicians to interact with radiological images slice-by-slice, leading to increased workload.
  • Figure 2: RadioActive overview. Although our evaluation is performed on entire 3D volumes, the benchmark accommodates both 3D and 2D interactive segmentation methods. While 3D model prompting is relatively straightforward, we introduce prompting and refinement strategies for 2D models that minimize the effort required from human interaction. The benchmark is designed to be extensible, and researchers are encouraged to propose and integrate additional methods seamlessly using our codebase, particularly for areas marked by three dots.
  • Figure 3: Some models operate natively in 3D and enable full 3D interaction. Only models that accept mask prompts allow iterative refinement of initial predictions with human guidance.
  • Figure 4: Different prompting schemes for 2D models based on point prompts (on the left) and box prompts (on the right). While a) and b) expect unrealistic human slice-by-slice interaction, c) and d) illustrate the proposed prompt interpolation schemes, where a human needs to provide prompts for at least 3 slices (4 slices in this case). Prompts for the remaining slices are generated by interpolating between the initial prompts. e) and f) present the proposed prompt propagation methods, where the prompt for each subsequent slice is automatically generated based on the model's prediction from the previous slice. Only the initial slice and upper and lower boundaries require manual prompts.
  • Figure 5: Unrealistic prompting of 2D Boxes each slice performs best. 2D models prompted with one Box Prompt Per Slice (BPS), indicated with a star, outperform models prompted with various Point Prompts per Slice (PPS, line plots). Drawing a 2D box accounts for 2 interactions similar to providing 2 points. Providing alternating positive and negative points (dashed lines) is slightly superior to only positive points.
  • ...and 4 more figures