Proxy Prompt: Endowing SAM and SAM 2 with Auto-Interactive-Prompt for Medical Segmentation
Wang Xinyi, Kang Hongyu, Wei Peishan, Shuai Li, Yu Sun, Sai Kit Lam, Yongping Zheng
TL;DR
The paper tackles the bottleneck of manual prompting and limited human–model interaction in medical segmentation with SAM/SAM 2. It introduces Proxy Prompt Generator (PPG), comprising Contextual Selective Module (CSM) and Contextual Colorization Module (CCM), to derive high-dimensional prompts from non-target data and user masks, enabling automated prompting and flexible task switching without retraining. The approach leverages Vision Mamba for context selection and dual-reverse cross-attention to encode user intent, yielding state-of-the-art or near-full-data performance across image, MRI, and video datasets with as few as 16 training masks. This work demonstrates strong cross-domain generalization, robust real-time applicability, and the potential for rapid adaptation to evolving foundation models, thereby enhancing clinical adoption of large-scale segmentation models.
Abstract
In this paper, we aim to address the unmet demand for automated prompting and enhanced human-model interactions of SAM and SAM2 for the sake of promoting their widespread clinical adoption. Specifically, we propose Proxy Prompt (PP), auto-generated by leveraging non-target data with a pre-annotated mask. We devise a novel 3-step context-selection strategy for adaptively selecting the most representative contextual information from non-target data via vision mamba and selective maps, empowering the guiding capability of non-target image-mask pairs for segmentation on target image/video data. To reinforce human-model interactions in PP, we further propose a contextual colorization module via a dual-reverse cross-attention to enhance interactions between target features and contextual-embedding with amplifying distinctive features of user-defined object(s). Via extensive evaluations, our method achieves state-of-the-art performance on four public datasets and yields comparable results with fully-trained models, even when trained with only 16 image masks.
