Table of Contents
Fetching ...

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

Satrajit Chakrabarty, Ravi Soni

TL;DR

The study conducts a large-scale, controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical data using purely visual prompts. It standardizes prompting, propagation, and evaluation across 16 datasets spanning CT, MRI, ultrasound, and endoscopy, revealing that SAM 3 offers substantially better prompt initialization and tracking for complex structures, while SAM 2 provides greater stability for compact, rigid organs under strong spatial guidance. The results position SAM 3 as the superior general-purpose default for medical segmentation, with caveats related to propagation failures in certain modalities and anatomy, and the work highlights the value of future work incorporating concept-based prompts. These findings inform practical model selection for clinical and research workflows and establish a baseline for future exploration of vision-language prompting in medical imaging.

Abstract

Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

TL;DR

The study conducts a large-scale, controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical data using purely visual prompts. It standardizes prompting, propagation, and evaluation across 16 datasets spanning CT, MRI, ultrasound, and endoscopy, revealing that SAM 3 offers substantially better prompt initialization and tracking for complex structures, while SAM 2 provides greater stability for compact, rigid organs under strong spatial guidance. The results position SAM 3 as the superior general-purpose default for medical segmentation, with caveats related to propagation failures in certain modalities and anatomy, and the work highlights the value of future work incorporating concept-based prompts. These findings inform practical model selection for clinical and research workflows and establish a baseline for future exploration of vision-language prompting in medical imaging.

Abstract

Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

Paper Structure

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of images overlaid with ground-truth annotation masks from the sixteen publicly available medical imaging datasets used for evaluation in this study, spanning 3D CT, 3D MRI, ultrasound (2D cine and 3D volumes), and endoscopy. The panel layout is designed to capture variability in anatomical regions, pathologies, imaging contrast, and acquisition settings (refer to Table \ref{['tab:dataset-description']} for dataset details). The human anatomy illustration is adapted from a graphic obtained from https://www.vecteezy.com.
  • Figure 2: Boxplots showing zero-shot segmentation performance for all anatomical structures in each CT dataset. For every structure, eight boxplots appear in a fixed left-to-right order corresponding to the four prompting modes of SAM 2 followed by the four prompting modes of SAM 3 [ SAM 2 (1,0), SAM 2 (1,2), SAM 2 (BBox), SAM 2 (Mask), SAM 3 (1,0), SAM 3 (1,2), SAM 3 (BBox), SAM 3 (Mask)].
  • Figure 3: Boxplots showing zero-shot segmentation performance for all anatomical structures in each MR, US, and Endoscopy dataset. For every structure, eight boxplots appear in a fixed left-to-right order corresponding to the four prompting modes of SAM 2 followed by the four prompting modes of SAM 3 [ SAM 2 (1,0), SAM 2 (1,2), SAM 2 (BBox), SAM 2 (Mask), SAM 3 (1,0), SAM 3 (1,2), SAM 3 (BBox), SAM 3 (Mask)].
  • Figure 4: Qualitative examples illustrating cases where SAM 3 outperforms SAM 2. All three examples are for single-click (1,0) prompting and show SAM 3’s superior prompt initialization by better localizing the structure even under sparse prompts. SAM 3 produces accurate, spatially coherent segmentations even for small or low-contrast structures, whereas SAM 2 exhibits failure to localize the target on the prompted frame, resulting in over-segmentation and notably lower DSC. [Colors: ground truth in green, SAM 2 predictions in blue, and SAM 3 predictions in red.]
  • Figure 5: Qualitative examples illustrating cases where SAM 2 outperforms SAM 3. Examples 1--2 are for bbox and example 3 is for mask prompt. In these examples, SAM 3 provides strong initial localization but exhibits propagation failures, including hallucinated residual masks in later slices (Examples 1--2) and erosion or collapse of structure boundaries under low contrast or motion (Example 3). In contrast, SAM 2 maintains more stable slice-to-slice consistency and suppresses spurious predictions, yielding higher DSC. [Colors: ground truth in green, SAM 2 predictions in blue, and SAM 3 predictions in red.]