RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation
Xiang Gao, Kai Lu
TL;DR
This work tackles the challenge of applying the Segment Anything Model to 3D medical imaging by introducing RefSAM3D, which adds a 3D image adapter, cross-modal reference prompts, and a hierarchical cross-attention mechanism to capture volumetric context. The method processes volumetric data with a 3D patch embedding strategy and a lightweight adapter, while text prompts encoded via CLIP are aligned with visual features through a cross-modal projector and hierarchical attention to produce cross-modal prompts for segmentation. Through extensive experiments on CT and MRI datasets (e.g., KiTS21, LiTS, BTCV, AMOS 22), RefSAM3D achieves state-of-the-art performance, demonstrates strong zero-shot and few-shot generalization, and shows robust boundary precision via a 3D mask decoder with multi-level aggregation. The proposed cross-modal prompting and 3D adaptation provide a practical pathway for reliable, promptable 3D medical segmentation with potential clinical impact in organ/tumor quantification and treatment planning.
Abstract
The Segment Anything Model (SAM), originally built on a 2D Vision Transformer (ViT), excels at capturing global patterns in 2D natural images but struggles with 3D medical imaging modalities like CT and MRI. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, which adapts SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate the superior performance of RefSAM3D over state-of-the-art methods. Our contributions advance the application of SAM in accurately segmenting complex anatomical structures in medical imaging.
