Table of Contents
Fetching ...

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

TL;DR

This work tackles surgical video segmentation by bridging the domain gap between natural and surgical scenes and enabling robust long-term tracking. It introduces SA-SV, the largest surgical iVOS benchmark with masklet annotations across eight procedures, and SAM2S, a surgical specialization of SAM2 that adds DiveMem for long-term memory, Temporal Semantic Learning for instrument-aware semantics, and Ambiguity-Resilient Learning for multi-source label smoothing. The approach yields strong zero-shot generalization and real-time performance, achieving an average $ ext{J} F$ of 80.42 on 3-click prompts and 68 FPS, outperforming vanilla and fine-tuned baselines. The results demonstrate that domain-specific data and memory-augmented, semantically informed learning significantly improve surgical video segmentation, with potential to enhance intraoperative guidance and assessment. The authors also provide dataset and code releases to enable broad adoption and further research.

Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

TL;DR

This work tackles surgical video segmentation by bridging the domain gap between natural and surgical scenes and enabling robust long-term tracking. It introduces SA-SV, the largest surgical iVOS benchmark with masklet annotations across eight procedures, and SAM2S, a surgical specialization of SAM2 that adds DiveMem for long-term memory, Temporal Semantic Learning for instrument-aware semantics, and Ambiguity-Resilient Learning for multi-source label smoothing. The approach yields strong zero-shot generalization and real-time performance, achieving an average of 80.42 on 3-click prompts and 68 FPS, outperforming vanilla and fine-tuned baselines. The results demonstrate that domain-specific data and memory-augmented, semantically informed learning significantly improve surgical video segmentation, with potential to enhance intraoperative guidance and assessment. The authors also provide dataset and code releases to enable broad adoption and further research.

Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average \& over vanilla SAM2. SAM2S further advances performance to 80.42 average \&, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of SA-SV benchmark and SAM2S framework. (a) Dataset scale comparison. (b) SA-SV benchmark distribution. (c) SAM2 for natural videos. (d) SAM2S for surgical videos with enhanced long-term tracking and domain-specific modules.
  • Figure 2: Overview of SAM2S for surgical video segmentation. DiveMem handles long-term tracking, TSL enhances semantic understanding, and ARL addresses annotation ambiguity.
  • Figure 3: Qualitative comparison between SAM2 (vanilla), SAM2 (FT), SAM2Long (FT), and SAM2S on RARP50. Frame indices indicate timestamps in seconds, spanning from 150s to 560s (410s duration).
  • Figure 4: Qualitative comparison on EndoVis18 (140s duration).