Table of Contents
Fetching ...

SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

Wenxi Yue, Jing Zhang, Kun Hu, Qiuxia Wu, Zongyuan Ge, Yong Xia, Jiebo Luo, Zhiyong Wang

TL;DR

SurgicalPart-SAM (SP-SAM) tackles the challenge of text promptable surgical instrument segmentation by embedding instrument part structure into a SAM-tuning framework. It introduces Collaborative Prompts that describe each instrument part, a Cross-Modal Prompt Encoder to fuse text and image information, and Part-to-Whole Adaptive Fusion with Hierarchical Decoding to produce accurate part- and whole-instrument masks. The approach leverages a category-part relation matrix $\,\mathcal{D}_{CP} \in \{0,1\}^{C\times P}$ and a cross-modal interaction pipeline to achieve state-of-the-art results on EndoVis2018 and EndoVis2017 with only a small number of tunable parameters, outperforming prior SAM-based methods and many specialist models. The work demonstrates the potential of efficient foundation-model adaptation for fine-grained, domain-specific segmentation tasks and suggests avenues for extending text prompts to temporal cues and additional background targets.

Abstract

The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM's generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.

SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

TL;DR

SurgicalPart-SAM (SP-SAM) tackles the challenge of text promptable surgical instrument segmentation by embedding instrument part structure into a SAM-tuning framework. It introduces Collaborative Prompts that describe each instrument part, a Cross-Modal Prompt Encoder to fuse text and image information, and Part-to-Whole Adaptive Fusion with Hierarchical Decoding to produce accurate part- and whole-instrument masks. The approach leverages a category-part relation matrix and a cross-modal interaction pipeline to achieve state-of-the-art results on EndoVis2018 and EndoVis2017 with only a small number of tunable parameters, outperforming prior SAM-based methods and many specialist models. The work demonstrates the potential of efficient foundation-model adaptation for fine-grained, domain-specific segmentation tasks and suggests avenues for extending text prompts to temporal cues and additional background targets.

Abstract

The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM's generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.
Paper Structure (15 sections, 2 equations, 6 figures, 4 tables)

This paper contains 15 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: SP-SAM with Collaborative Prompts incorporates the knowledge of surgical instrument structures. Subfigure (e) is partially is excerpted from endovis2017.
  • Figure 2: Overview of SP-SAM. SP-SAM consists of four main components: a SAM Image Encoder, a Cross-Modal Prompt Encoder, a Part-to-Whole Adaptive Fusion module, and a SAM Decoder. The SAM Image Encoder, CLIP Text Encoder (within the Cross-Modal Prompt Encoder), and output MLPs in SAM Decoder are frozen and the remaining weights are tuned.
  • Figure 3: Cross-Modal Prompt Encoder consists of feature extraction of Collaborative Prompts and part-level cross-modal encoding.
  • Figure 4: Part-to-Whole Adaptive Fusion Module adaptively assembles the sparse and dense embeddings of the parts into the sparse and dense embeddings of the whole instrument via Category Part Attention and Image Part Attention.
  • Figure 5: Visual comparison of predicted masks by different methods.
  • ...and 1 more figures