Table of Contents
Fetching ...

SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

Jieming Yu, An Wang, Wenzhen Dong, Mengya Xu, Mobarakol Islam, Jie Wang, Long Bai, Hongliang Ren

TL;DR

The paper investigates the zero-shot segmentation capabilities of SAM 2 in robotic-assisted surgery, focusing on surgical instrument segmentation under two prompting schemes (bounding box and 1-point) for images and a 1-point prompt on the initial video frame. Through evaluation on the EndoVis 2017 and 2018 benchmarks, bounding box prompts achieve state-of-the-art performance, while 1-point prompts significantly boost accuracy and enable faster results, including video segmentation that surpasses prior image-based approaches. The study also assesses robustness to real-world corruptions, showing SAM 2 generally degrades less than the original SAM, though video prompts remain more vulnerable under severe perturbations. Overall, SAM 2 exhibits strong robustness and rapid inference for downstream surgical tasks with minimal prompting, though edge-case segmentation and prompt-free methods remain open directions for future work.

Abstract

The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts, alongside its robustness against real-world corruption. For static images, we employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. Through extensive experimentation on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box prompts, outperforms state-of-the-art (SOTA) methods in comparative evaluations. The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference speed and less performance degradation against various image corruption. Although slightly unsatisfactory results remain in specific edges or regions, SAM 2's robust adaptability to 1-point prompts underscores its potential for downstream surgical tasks with limited prompt requirements.

SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

TL;DR

The paper investigates the zero-shot segmentation capabilities of SAM 2 in robotic-assisted surgery, focusing on surgical instrument segmentation under two prompting schemes (bounding box and 1-point) for images and a 1-point prompt on the initial video frame. Through evaluation on the EndoVis 2017 and 2018 benchmarks, bounding box prompts achieve state-of-the-art performance, while 1-point prompts significantly boost accuracy and enable faster results, including video segmentation that surpasses prior image-based approaches. The study also assesses robustness to real-world corruptions, showing SAM 2 generally degrades less than the original SAM, though video prompts remain more vulnerable under severe perturbations. Overall, SAM 2 exhibits strong robustness and rapid inference for downstream surgical tasks with minimal prompting, though edge-case segmentation and prompt-free methods remain open directions for future work.

Abstract

The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts, alongside its robustness against real-world corruption. For static images, we employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. Through extensive experimentation on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box prompts, outperforms state-of-the-art (SOTA) methods in comparative evaluations. The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference speed and less performance degradation against various image corruption. Although slightly unsatisfactory results remain in specific edges or regions, SAM 2's robust adaptability to 1-point prompts underscores its potential for downstream surgical tasks with limited prompt requirements.
Paper Structure (10 sections, 2 figures, 3 tables)

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Qualitative results of SAM 2 on three images of the surgical scene.
  • Figure 2: Qualitative results of SAM 2 under 18 data corruptions of level-5 severity. Given that the implementation of specific transformations (e.g., spatter) relies on random functions, and the corrupted dataset in our previous version is no longer accessible, we have regenerated the corrupted images. While some types of images may exhibit slight variations, the overall statistical consistency ensures the reliability of our findings.