Table of Contents
Fetching ...

Proxy Prompt: Endowing SAM and SAM 2 with Auto-Interactive-Prompt for Medical Segmentation

Wang Xinyi, Kang Hongyu, Wei Peishan, Shuai Li, Yu Sun, Sai Kit Lam, Yongping Zheng

TL;DR

The paper tackles the bottleneck of manual prompting and limited human–model interaction in medical segmentation with SAM/SAM 2. It introduces Proxy Prompt Generator (PPG), comprising Contextual Selective Module (CSM) and Contextual Colorization Module (CCM), to derive high-dimensional prompts from non-target data and user masks, enabling automated prompting and flexible task switching without retraining. The approach leverages Vision Mamba for context selection and dual-reverse cross-attention to encode user intent, yielding state-of-the-art or near-full-data performance across image, MRI, and video datasets with as few as 16 training masks. This work demonstrates strong cross-domain generalization, robust real-time applicability, and the potential for rapid adaptation to evolving foundation models, thereby enhancing clinical adoption of large-scale segmentation models.

Abstract

In this paper, we aim to address the unmet demand for automated prompting and enhanced human-model interactions of SAM and SAM2 for the sake of promoting their widespread clinical adoption. Specifically, we propose Proxy Prompt (PP), auto-generated by leveraging non-target data with a pre-annotated mask. We devise a novel 3-step context-selection strategy for adaptively selecting the most representative contextual information from non-target data via vision mamba and selective maps, empowering the guiding capability of non-target image-mask pairs for segmentation on target image/video data. To reinforce human-model interactions in PP, we further propose a contextual colorization module via a dual-reverse cross-attention to enhance interactions between target features and contextual-embedding with amplifying distinctive features of user-defined object(s). Via extensive evaluations, our method achieves state-of-the-art performance on four public datasets and yields comparable results with fully-trained models, even when trained with only 16 image masks.

Proxy Prompt: Endowing SAM and SAM 2 with Auto-Interactive-Prompt for Medical Segmentation

TL;DR

The paper tackles the bottleneck of manual prompting and limited human–model interaction in medical segmentation with SAM/SAM 2. It introduces Proxy Prompt Generator (PPG), comprising Contextual Selective Module (CSM) and Contextual Colorization Module (CCM), to derive high-dimensional prompts from non-target data and user masks, enabling automated prompting and flexible task switching without retraining. The approach leverages Vision Mamba for context selection and dual-reverse cross-attention to encode user intent, yielding state-of-the-art or near-full-data performance across image, MRI, and video datasets with as few as 16 training masks. This work demonstrates strong cross-domain generalization, robust real-time applicability, and the potential for rapid adaptation to evolving foundation models, thereby enhancing clinical adoption of large-scale segmentation models.

Abstract

In this paper, we aim to address the unmet demand for automated prompting and enhanced human-model interactions of SAM and SAM2 for the sake of promoting their widespread clinical adoption. Specifically, we propose Proxy Prompt (PP), auto-generated by leveraging non-target data with a pre-annotated mask. We devise a novel 3-step context-selection strategy for adaptively selecting the most representative contextual information from non-target data via vision mamba and selective maps, empowering the guiding capability of non-target image-mask pairs for segmentation on target image/video data. To reinforce human-model interactions in PP, we further propose a contextual colorization module via a dual-reverse cross-attention to enhance interactions between target features and contextual-embedding with amplifying distinctive features of user-defined object(s). Via extensive evaluations, our method achieves state-of-the-art performance on four public datasets and yields comparable results with fully-trained models, even when trained with only 16 image masks.

Paper Structure

This paper contains 34 sections, 13 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Illustration of comparison without/with PP in (a-b) SAM 2 using real-time ultrasound frames of 1 subject; and (c-d) SAM using a fundus retina dataset of 3 subjects.
  • Figure 2: Schematic differences of traditional prompt encoder and our proposed PPG in SAM 2 (a-b) and SAM (c-d).
  • Figure 3: Designed Proxy Prompt Generator for both SAM 2 and SAM. Our key designed focus on the Contextual Selective Module and the Contextual Colorization Module. The Encoder and Decoder refer to the original structures, which are frozen.
  • Figure 4: Visualization comparison results of nine models across five objects.
  • Figure 5: Selective Map visualization on support images with different mean square errors (MSE) to the target image.
  • ...and 6 more figures