More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu, Albert C. S. Chung, Hongliang Ren, Long Bai
TL;DR
This paper benchmarks SAM 3 and SAM 3D in robot-assisted surgery across 2D segmentation, depth reconstruction, and 3D instrument segmentation. It shows that SAM 3 achieves state-of-the-art zero-shot segmentation with spatial prompts on EndoVis17/18, while language prompts underperform due to domain gaps. SAM 3D demonstrates strong monocular depth priors and plausible 3D instrument reconstructions in static or controlled scenes, but struggles in highly dynamic, narrow-baseline settings and incurs substantial computational cost. The findings indicate clear potential for data-efficient surgical analysis using SAM 3, while highlighting the need for domain-specific fine-tuning and optimized 3D perception for robust operating-room applications.
Abstract
The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero-shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
