Self-Prompt SAM: Medical Image Segmentation via Automatic Prompt SAM Adaptation
Bin Xie, Hao Tang, Dawen Cai, Yan Yan, Gady Agam
TL;DR
The paper addresses the challenge of applying SAM, a prompt-driven natural-image segmentation foundation model, to 3D medical image segmentation where manual prompts and semantic labeling are limiting. It introduces Self-Prompt-SAM, which combines a Multi-Scale Prompt Generator (MSPGenerator) to produce auxiliary multi-class masks, distance-transform-based prompts (points and boxes), and a depth-aware 3D adapter (DFusedAdapter) inserted into the image encoder and mask decoder to enable 3D information extraction while keeping SAM's weights frozen. An MC-Adapter is also proposed to map binary masks into multi-class segmentation outputs, along with depth positional embeddings for depth-aware processing. Evaluations on AMOS2022, Synapse, and ACDC show state-of-the-art Dice scores, with Self-Prompt-SAM outperforming nnUNet by 2.3% on AMOS2022, 1.6% on ACDC, and 0.5% on Synapse, demonstrating effective automatic prompting and 3D adaptation for medical imaging.
Abstract
Segment Anything Model (SAM) has demonstrated impressive zero-shot performance and brought a range of unexplored capabilities to natural image segmentation tasks. However, as a very important branch of image segmentation, the performance of SAM remains uncertain when applied to medical image segmentation due to the significant differences between natural images and medical images. Meanwhile, it is harsh to meet the SAM's requirements of extra prompts provided, such as points or boxes to specify medical regions. In this paper, we propose a novel self-prompt SAM adaptation framework for medical image segmentation, named Self-Prompt-SAM. We design a multi-scale prompt generator combined with the image encoder in SAM to generate auxiliary masks. Then, we use the auxiliary masks to generate bounding boxes as box prompts and use Distance Transform to select the most central points as point prompts. Meanwhile, we design a 3D depth-fused adapter (DfusedAdapter) and inject the DFusedAdapter into each transformer in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Extensive experiments demonstrate that our method achieves state-of-the-art performance and outperforms nnUNet by 2.3% on AMOS2022, 1.6% on ACDCand 0.5% on Synapse datasets.
