Table of Contents
Fetching ...

UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

Haofeng Liu, Ziyue Wang, Alex Y. W. Kong, Guanyi Qin, Yunqiu Xu, Chang Han Low, Mingqi Gao, Lap Yan Lennon Chan, Yueming Jin

Abstract

Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at https://jinlab-imvr.github.io/UniSurgSAM.

UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

Abstract

Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at https://jinlab-imvr.github.io/UniSurgSAM.

Paper Structure

This paper contains 42 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Motivation. (A) Existing single-modal PVOS methods rely on a coupled decoder that causes optimization interference, with fragile open-loop pipelines prone to hallucinations and mask drift. (B) UniSurgSAM achieves unified PVOS through the decoupled two-stage framework for stable optimization, presence-aware decoding to suppress hallucinations, boundary-aware tracking to prevent mask drift, and adaptive state transition for closed-loop failure recovery. Textual prompt for the example: "large needle driver is manipulating tool on the right." In (A) at time $t$, the model hallucinates a mask on the wrong object when the target is absent.
  • Figure 2: Overview of UniSurgSAM. The model adopts a decoupled two-stage framework that independently supports visual, textual, or audio prompts: Stage I performs promptable initialization from the given prompt, while Stage II conducts boundary-aware long-term tracking. For linguistic prompts, AST acts as a central controller that routes data to the detector or tracker via a selector, coordinating bidirectional switching through credible activation (Entry) and consensus-based fallback (Exit).
  • Figure 3: Illustration of Diversity-Driven Long-Term Memory.
  • Figure 4: Qualitative comparison in Uni-EndoVis18 (left) and Uni-RARP50 (right) for Textual PVOS. FP and FN denote false positive and false negative predictions, respectively. Numbers in images denote time in seconds.
  • Figure 5: Qualitative comparison in Uni-EndoVis17 for Visual PVOS under three-point initialization.
  • ...and 3 more figures