SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin
TL;DR
This work tackles surgical video segmentation by bridging the domain gap between natural and surgical scenes and enabling robust long-term tracking. It introduces SA-SV, the largest surgical iVOS benchmark with masklet annotations across eight procedures, and SAM2S, a surgical specialization of SAM2 that adds DiveMem for long-term memory, Temporal Semantic Learning for instrument-aware semantics, and Ambiguity-Resilient Learning for multi-source label smoothing. The approach yields strong zero-shot generalization and real-time performance, achieving an average $ ext{J} F$ of 80.42 on 3-click prompts and 68 FPS, outperforming vanilla and fine-tuned baselines. The results demonstrate that domain-specific data and memory-augmented, semantically informed learning significantly improve surgical video segmentation, with potential to enhance intraoperative guidance and assessment. The authors also provide dataset and code releases to enable broad adoption and further research.
Abstract
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
