Table of Contents
Fetching ...

Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis

Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Yueming Jin, Qi Dou, Yutong Ban

TL;DR

This work systematically evaluates the zero-shot segmentation capabilities of SAM2 on 9 surgical datasets spanning multiple modalities, analyzing prompting strategies, reinitialization, auto-segmentation, and sparse finetuning to produce practical deployment guidelines. It demonstrates strong zero-shot performance in structured instrument, scene, and multi-organ segmentation, but reveals limitations in dynamic surgical environments and temporal coherence, which can be mitigated by mask-based prompts and periodic reinitialization, while MD-only finetuning often provides the best efficiency-accuracy balance. The study contributes actionable insights for data-efficient surgical video analysis, offering a unified framework for multi-object segmentation and a roadmap for domain-adaptive fine-tuning of foundation models in surgery.

Abstract

Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.

Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis

TL;DR

This work systematically evaluates the zero-shot segmentation capabilities of SAM2 on 9 surgical datasets spanning multiple modalities, analyzing prompting strategies, reinitialization, auto-segmentation, and sparse finetuning to produce practical deployment guidelines. It demonstrates strong zero-shot performance in structured instrument, scene, and multi-organ segmentation, but reveals limitations in dynamic surgical environments and temporal coherence, which can be mitigated by mask-based prompts and periodic reinitialization, while MD-only finetuning often provides the best efficiency-accuracy balance. The study contributes actionable insights for data-efficient surgical video analysis, offering a unified framework for multi-object segmentation and a roadmap for domain-adaptive fine-tuning of foundation models in surgery.

Abstract

Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.
Paper Structure (20 sections, 4 figures, 2 tables)

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: SAM2 overview and evaluated datasets. a) SAM2 application across different surgical specialties; b) Quantitative segmentation performance across SAM2 prompt types, including point, box and mask prompts; c) Datasets used for SAM2 evaluation, including 8 public datasets and 1 private dataset.
  • Figure 2: Instrument segmentation performance of SAM2 across surgical datasets. a) Qualitative comparison showing segmentation results on EndoVis2017 dataset across multiple frames, demonstrating temporal consistency of instrument tracking. Different colors represent distinct surgical instruments identified in each frame; b) Radar chart displaying class-wise Intersection over Union (IoU) performance for various surgical instrument types on the EndoVis2018 dataset. Each axis represents a different instrument class with values ranging from 0 to 1, where higher values indicate better segmentation accuracy; c) Quantitative evaluation on the EndoNeRF Cutting dataset showing mean IoU (mIoU) and mean Dice coefficient (mDice) metrics. Bar heights represent segmentation accuracy with error bars indicating standard deviation across test samples.
  • Figure 3: Qualitative segmentation results for multi-organ surgical scenes on the DSAD dataset. Each row shows frames from a surgical video sequence, starting with the first row that contains the prompt frame (Frame 00), where colored dots indicate point prompts, colored rectangles indicate box prompts, and colored irregular shapes indicate mask prompts. The following rows show subsequent frames (Frames 04, 09, 19, 39, and 69) with results predicted by SAM2 using the corresponding prompts. Columns represent: 1) Current Frame: the original frame; 2) Ground Truth: manual multi-organ annotations; 3) 1 Random Point: segmentation with a single random point prompt per target; 4) 5 Random Points: segmentation with five random point prompts per target; 5) Bounding Box: segmentation guided by a bounding box for each target; and 6) Mask: segmentation initialized from a predefined mask for each target. The comparison demonstrates the impact of different prompting strategies on multi-organ segmentation quality and temporal consistency across frames.
  • Figure 4: Qualitative comparison showing segmentation results on CholecSeg8k and Endoscapes2023 dataset. Mask prompting achieves the best segmentation performance on both datasets. a) On CholecSeg8k dataset, different colors represent multiple instruments, organs, and tissue. Point prompting cannot recognize the complex background with various tissue; b) On Endoscapes2023 dataset, tubular anatomy structure such as the cystic artery can be recognized more and more better with the expansion of prompting scope.