Table of Contents
Fetching ...

MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation

Cheng Chen, Juzheng Miao, Dufan Wu, Zhiling Yan, Sekeun Kim, Jiang Hu, Aoxiao Zhong, Zhengliang Liu, Lichao Sun, Xiang Li, Tianming Liu, Pheng-Ann Heng, Quanzheng Li

TL;DR

This work tackles the gap between SAM's natural-image training and the needs of medical image segmentation by introducing MA-SAM, a modality-agnostic adaptation that combines FacT-based parameter-efficient fine-tuning of the 2D encoder with lightweight 3D adapters to extract volumetric/temporal information, and a fully fine-tuned mask decoder with progressive upsampling. The method demonstrates strong automatic segmentation across CT, MRI, and surgical video tasks, outperforming state-of-the-art 3D approaches and showing solid generalization to new datasets and modalities. Prompt-based segmentation further enhances performance in challenging tumor scenarios, highlighting the practical value of combining 2D foundation-model weights with targeted 3D adaptations for medical imaging. Overall, MA-SAM offers a scalable, generalizable framework for applying foundation-model segmentation to diverse medical imaging modalities.

Abstract

The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM's performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM's pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM's pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. The effectiveness of our method has been comprehensively evaluated on four medical image segmentation tasks, by using 10 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: https://github.com/cchen-cc/MA-SAM.

MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation

TL;DR

This work tackles the gap between SAM's natural-image training and the needs of medical image segmentation by introducing MA-SAM, a modality-agnostic adaptation that combines FacT-based parameter-efficient fine-tuning of the 2D encoder with lightweight 3D adapters to extract volumetric/temporal information, and a fully fine-tuned mask decoder with progressive upsampling. The method demonstrates strong automatic segmentation across CT, MRI, and surgical video tasks, outperforming state-of-the-art 3D approaches and showing solid generalization to new datasets and modalities. Prompt-based segmentation further enhances performance in challenging tumor scenarios, highlighting the practical value of combining 2D foundation-model weights with targeted 3D adaptations for medical imaging. Overall, MA-SAM offers a scalable, generalizable framework for applying foundation-model segmentation to diverse medical imaging modalities.

Abstract

The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM's performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM's pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM's pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. The effectiveness of our method has been comprehensively evaluated on four medical image segmentation tasks, by using 10 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: https://github.com/cchen-cc/MA-SAM.
Paper Structure (18 sections, 3 equations, 8 figures, 10 tables)

This paper contains 18 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The overview of our proposed modality-agnostic SAM adaptation framework (MA-SAM) for medical image segmentation. The image encoder is updated through a parameter-efficient fine-tuning strategy with FacT. The volumetric or temporal information is effectively incorporated via a set of 3D adapters. The mask decoder is fully fine-tuned and modified to recover the prediction resolution. Reshape operations are used to make 3D operations compatible with the 2D backbone.
  • Figure 2: Qualitative visualization of segmentation results generated from our MA-SAM method and other state-of-the-art methods on BTCV dataset. Abdominal organs are denoted in different colors as shown in the corresponding color bar.
  • Figure 3: Comparison of segmentation results from different methods for surgical scene segmentation on Endovis18 dataset.
  • Figure 3: Qualitative visualization of segmentation results generated from our MA-SAM method and other state-of-the-art methods on prostate MRI datasets. The prostate boundary is delineated in green for ground truth, in orange for our method, and in red for other methods, respectively.
  • Figure 4: Qualitative visualization of segmentation results generated from different methods for surgical video data. Classes are denoted in different colors.
  • ...and 3 more figures