Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

Satrajit Chakrabarty; Ravi Soni

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

Satrajit Chakrabarty, Ravi Soni

TL;DR

The study conducts a large-scale, controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical data using purely visual prompts. It standardizes prompting, propagation, and evaluation across 16 datasets spanning CT, MRI, ultrasound, and endoscopy, revealing that SAM 3 offers substantially better prompt initialization and tracking for complex structures, while SAM 2 provides greater stability for compact, rigid organs under strong spatial guidance. The results position SAM 3 as the superior general-purpose default for medical segmentation, with caveats related to propagation failures in certain modalities and anatomy, and the work highlights the value of future work incorporating concept-based prompts. These findings inform practical model selection for clinical and research workflows and establish a baseline for future exploration of vision-language prompting in medical imaging.

Abstract

Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

TL;DR

Abstract

Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)