Table of Contents
Fetching ...

Augmenting Efficient Real-time Surgical Instrument Segmentation in Video with Point Tracking and Segment Anything

Zijian Wu, Adam Schmidt, Peter Kazanzides, Septimiu E. Salcudean

TL;DR

This work tackles real-time surgical instrument segmentation (SIS) by integrating an online Tracking Any Point (TAP) framework with a fine-tuned, lightweight Segment Anything Model (SAM). Initial sparse prompts within a region of interest are propagated frame-by-frame via CoTracker, enabling temporally consistent prompts for a MobileSAM-based segmentation backbone, which is further enhanced by full fine-tuning on surgical data using a combined $L = L_{BCE} + L_{Dice}$ loss. The approach achieves state-of-the-art-like performance on EndoVis 2015 (IoU $=84.4$, Dice $=91.0$) and competitive results on UCL dVRK and CholecSeg8k, while delivering substantial efficiency gains (up to 90 FPS on high-end GPUs and ~26 FPS on consumer GPUs). The method demonstrates zero-shot generalization on STIR and highlights the practical potential for real-time AR guidance in robotic surgery, with future work aimed at richer temporal priors and multimodal integration.

Abstract

The Segment Anything Model (SAM) is a powerful vision foundation model that is revolutionizing the traditional paradigm of segmentation. Despite this, a reliance on prompting each frame and large computational cost limit its usage in robotically assisted surgery. Applications, such as augmented reality guidance, require little user intervention along with efficient inference to be usable clinically. In this study, we address these limitations by adopting lightweight SAM variants to meet the efficiency requirement and employing fine-tuning techniques to enhance their generalization in surgical scenes. Recent advancements in Tracking Any Point (TAP) have shown promising results in both accuracy and efficiency, particularly when points are occluded or leave the field of view. Inspired by this progress, we present a novel framework that combines an online point tracker with a lightweight SAM model that is fine-tuned for surgical instrument segmentation. Sparse points within the region of interest are tracked and used to prompt SAM throughout the video sequence, providing temporal consistency. The quantitative results surpass the state-of-the-art semi-supervised video object segmentation method XMem on the EndoVis 2015 dataset with 84.8 IoU and 91.0 Dice. Our method achieves promising performance that is comparable to XMem and transformer-based fully supervised segmentation methods on ex vivo UCL dVRK and in vivo CholecSeg8k datasets. In addition, the proposed method shows promising zero-shot generalization ability on the label-free STIR dataset. In terms of efficiency, we tested our method on a single GeForce RTX 4060/4090 GPU respectively, achieving an over 25/90 FPS inference speed. Code is available at: https://github.com/wuzijian1997/SIS-PT-SAM

Augmenting Efficient Real-time Surgical Instrument Segmentation in Video with Point Tracking and Segment Anything

TL;DR

This work tackles real-time surgical instrument segmentation (SIS) by integrating an online Tracking Any Point (TAP) framework with a fine-tuned, lightweight Segment Anything Model (SAM). Initial sparse prompts within a region of interest are propagated frame-by-frame via CoTracker, enabling temporally consistent prompts for a MobileSAM-based segmentation backbone, which is further enhanced by full fine-tuning on surgical data using a combined loss. The approach achieves state-of-the-art-like performance on EndoVis 2015 (IoU , Dice ) and competitive results on UCL dVRK and CholecSeg8k, while delivering substantial efficiency gains (up to 90 FPS on high-end GPUs and ~26 FPS on consumer GPUs). The method demonstrates zero-shot generalization on STIR and highlights the practical potential for real-time AR guidance in robotic surgery, with future work aimed at richer temporal priors and multimodal integration.

Abstract

The Segment Anything Model (SAM) is a powerful vision foundation model that is revolutionizing the traditional paradigm of segmentation. Despite this, a reliance on prompting each frame and large computational cost limit its usage in robotically assisted surgery. Applications, such as augmented reality guidance, require little user intervention along with efficient inference to be usable clinically. In this study, we address these limitations by adopting lightweight SAM variants to meet the efficiency requirement and employing fine-tuning techniques to enhance their generalization in surgical scenes. Recent advancements in Tracking Any Point (TAP) have shown promising results in both accuracy and efficiency, particularly when points are occluded or leave the field of view. Inspired by this progress, we present a novel framework that combines an online point tracker with a lightweight SAM model that is fine-tuned for surgical instrument segmentation. Sparse points within the region of interest are tracked and used to prompt SAM throughout the video sequence, providing temporal consistency. The quantitative results surpass the state-of-the-art semi-supervised video object segmentation method XMem on the EndoVis 2015 dataset with 84.8 IoU and 91.0 Dice. Our method achieves promising performance that is comparable to XMem and transformer-based fully supervised segmentation methods on ex vivo UCL dVRK and in vivo CholecSeg8k datasets. In addition, the proposed method shows promising zero-shot generalization ability on the label-free STIR dataset. In terms of efficiency, we tested our method on a single GeForce RTX 4060/4090 GPU respectively, achieving an over 25/90 FPS inference speed. Code is available at: https://github.com/wuzijian1997/SIS-PT-SAM
Paper Structure (14 sections, 4 equations, 9 figures, 2 tables)

This paper contains 14 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The pre-processing workflow to generate the query points. The segmentation model can be semi-automatic, i.e., SAM + bounding box prompt (bottom row), or fully automatic, i.e., CLIPSeg + fixed text prompt “surgical tool” (top row). Note that, without the initial mask, just manually picking query points is also feasible.
  • Figure 2: The overview of our video surgical instrument segmentation pipeline that combines a segmentation model $Seg(\cdot)$ and a point tracker $TAP(\cdot)$.
  • Figure 3: The pipeline of SAM fine-tuning using points. The input for the SAM model consists of images paired with points within the target area. The green rectangle represents the MobileSAM architecture.
  • Figure 4: Visualization of segmentation results from several methods on the EndoVis 2015 dataset, in which the images are acquired from the testing Video 1, 5, and 6, respectively (from top row to bottom row). Note that red dots in Fig. 4 - 8 are the point prompts tracked by CoTracker.
  • Figure 5: Visualization of segmentation results from several methods on the UCL dVRK dataset, in which the images are acquired from the testing Video 1, 3, and 4, respectively (from top row to bottom row).
  • ...and 4 more figures