Table of Contents
Fetching ...

RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation

Xiang Gao, Kai Lu

TL;DR

This work tackles the challenge of applying the Segment Anything Model to 3D medical imaging by introducing RefSAM3D, which adds a 3D image adapter, cross-modal reference prompts, and a hierarchical cross-attention mechanism to capture volumetric context. The method processes volumetric data with a 3D patch embedding strategy and a lightweight adapter, while text prompts encoded via CLIP are aligned with visual features through a cross-modal projector and hierarchical attention to produce cross-modal prompts for segmentation. Through extensive experiments on CT and MRI datasets (e.g., KiTS21, LiTS, BTCV, AMOS 22), RefSAM3D achieves state-of-the-art performance, demonstrates strong zero-shot and few-shot generalization, and shows robust boundary precision via a 3D mask decoder with multi-level aggregation. The proposed cross-modal prompting and 3D adaptation provide a practical pathway for reliable, promptable 3D medical segmentation with potential clinical impact in organ/tumor quantification and treatment planning.

Abstract

The Segment Anything Model (SAM), originally built on a 2D Vision Transformer (ViT), excels at capturing global patterns in 2D natural images but struggles with 3D medical imaging modalities like CT and MRI. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, which adapts SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate the superior performance of RefSAM3D over state-of-the-art methods. Our contributions advance the application of SAM in accurately segmenting complex anatomical structures in medical imaging.

RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation

TL;DR

This work tackles the challenge of applying the Segment Anything Model to 3D medical imaging by introducing RefSAM3D, which adds a 3D image adapter, cross-modal reference prompts, and a hierarchical cross-attention mechanism to capture volumetric context. The method processes volumetric data with a 3D patch embedding strategy and a lightweight adapter, while text prompts encoded via CLIP are aligned with visual features through a cross-modal projector and hierarchical attention to produce cross-modal prompts for segmentation. Through extensive experiments on CT and MRI datasets (e.g., KiTS21, LiTS, BTCV, AMOS 22), RefSAM3D achieves state-of-the-art performance, demonstrates strong zero-shot and few-shot generalization, and shows robust boundary precision via a 3D mask decoder with multi-level aggregation. The proposed cross-modal prompting and 3D adaptation provide a practical pathway for reliable, promptable 3D medical segmentation with potential clinical impact in organ/tumor quantification and treatment planning.

Abstract

The Segment Anything Model (SAM), originally built on a 2D Vision Transformer (ViT), excels at capturing global patterns in 2D natural images but struggles with 3D medical imaging modalities like CT and MRI. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, which adapts SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate the superior performance of RefSAM3D over state-of-the-art methods. Our contributions advance the application of SAM in accurately segmenting complex anatomical structures in medical imaging.

Paper Structure

This paper contains 26 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (A) The overview of our proposed RefSAM3D for 3D medical image segmentation, which integrates hierarchical cross-attention between image and text modalities to generate accurate segmentation predictions. (B) The design of the Image Processor, which includes patch partitioning, convolutional-based patch embedding, and positional embedding to process volumetric 3D medical data. (C) The framework of the 3D Adapter, which incorporates multi-head attention, depth-wise 3D convolution, and up/down projection for efficient feature extraction and adaptation. (D) The pipeline of the Text Processor, which encodes textual prompts and aligns them with visual embeddings using a cross-modal MLP for enhanced segmentation guidance.
  • Figure 2: The structure of the Cross-Modal Prompt Embedding module. 1) The left part illustrates the overall architecture, where hierarchical visual embeddings from four stages interact with aligned textual embeddings using cross-attention mechanisms to generate cross-modal prompt embeddings. 2) The right part details the cross-attention mechanism, showing how attention weights are computed to align textual and visual embeddings through linear transformations and fusion, enabling effective multi-modal integration for downstream tasks.
  • Figure 3: Qualitative visualizations of the proposed method and baseline approaches on liver tumor, kidney tumor, pancreas tumor and colon cancer segmentation tasks.
  • Figure 4: Qualitative visualization of segmentation results generated from our Ref-SAM3D method and other state-of-the-art methods on BTCV dataset.
  • Figure 5: Qualitative visualization of segmentation results generated from different methods for MRI cardical tumor segmentation
  • ...and 1 more figures