Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis
Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Yueming Jin, Qi Dou, Yutong Ban
TL;DR
This work systematically evaluates the zero-shot segmentation capabilities of SAM2 on 9 surgical datasets spanning multiple modalities, analyzing prompting strategies, reinitialization, auto-segmentation, and sparse finetuning to produce practical deployment guidelines. It demonstrates strong zero-shot performance in structured instrument, scene, and multi-organ segmentation, but reveals limitations in dynamic surgical environments and temporal coherence, which can be mitigated by mask-based prompts and periodic reinitialization, while MD-only finetuning often provides the best efficiency-accuracy balance. The study contributes actionable insights for data-efficient surgical video analysis, offering a unified framework for multi-object segmentation and a roadmap for domain-adaptive fine-tuning of foundation models in surgery.
Abstract
Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.
