DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

Yifan Gao; Wei Xia; Dingdu Hu; Wenkui Wang; Xin Gao

DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

Yifan Gao, Wei Xia, Dingdu Hu, Wenkui Wang, Xin Gao

TL;DR

Domain generalization for medical image segmentation remains difficult under unseen distributions. The authors introduce DeSAM, a decoupled extension of Segment Anything Model (SAM) that splits mask generation into a prompt-relevant $IoU$-guided module (PRIM) and a prompt-decoupled mask module (PDMM), freezing the encoders to preserve pre-trained knowledge. PDMM fuses multi-scale image embeddings with PRIM-derived mask embeddings, enabling robust automatic segmentation across domains. Evaluations on cross-site prostate and cross-modality abdominal datasets show DeSAM-P achieving state-of-the-art Dice scores and outperforming prior single-source domain generalization methods, demonstrating practical potential for foundation-model-based medical segmentation with limited domain data.

Abstract

Deep learning-based medical image segmentation models often suffer from domain shift, where the models trained on a source domain do not generalize well to other unseen domains. As a prompt-driven foundation model with powerful generalization capabilities, the Segment Anything Model (SAM) shows potential for improving the cross-domain robustness of medical image segmentation. However, SAM performs significantly worse in automatic segmentation scenarios than when manually prompted, hindering its direct application to domain generalization. Upon further investigation, we discovered that the degradation in performance was related to the coupling effect of inevitable poor prompts and mask generation. To address the coupling effect, we propose the Decoupled SAM (DeSAM). DeSAM modifies SAM's mask decoder by introducing two new modules: a prompt-relevant IoU module (PRIM) and a prompt-decoupled mask module (PDMM). PRIM predicts the IoU score and generates mask embeddings, while PDMM extracts multi-scale features from the intermediate layers of the image encoder and fuses them with the mask embeddings from PRIM to generate the final segmentation mask. This decoupled design allows DeSAM to leverage the pre-trained weights while minimizing the performance degradation caused by poor prompts. We conducted experiments on publicly available cross-site prostate and cross-modality abdominal image segmentation datasets. The results show that our DeSAM leads to a substantial performance improvement over previous state-of-theart domain generalization methods. The code is publicly available at https://github.com/yifangao112/DeSAM.

DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

TL;DR

-guided module (PRIM) and a prompt-decoupled mask module (PDMM), freezing the encoders to preserve pre-trained knowledge. PDMM fuses multi-scale image embeddings with PRIM-derived mask embeddings, enabling robust automatic segmentation across domains. Evaluations on cross-site prostate and cross-modality abdominal datasets show DeSAM-P achieving state-of-the-art Dice scores and outperforming prior single-source domain generalization methods, demonstrating practical potential for foundation-model-based medical segmentation with limited domain data.

Abstract

Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Introduction
Related Work
Single-source domain generalization
Segment Anything Model
Decoupled Segment Anything Model
Architecture
Prompt-Relevant IoU Module (PRIM).
Prompt-Decoupled Mask Module (PDMM).
Training strategies
Results and Discussion
Dataset and implementation details
Ablation studies
Comparison with state-of-the-art methods
Conclusion
Acknowledgments.
...and 1 more sections

Figures (4)

Figure 1: Overview of the proposed DeSAM. The DeSAM consists of the image and prompt encoders of SAM, a prompt-decoupled mask module (PDMM), and a prompt-relevant IoU module (PRIM). The image encoder are used to compute the image embeddings before training. The prompt encoder is frozen during training. The PRIM consists of a cross-attention transformer and an IoU prediction head, and it utilizes the image and prompt embeddings to generate mask embeddings and IoU score. The PDMM contains multiple channel attention-based residual blocks (SRB) and upsampling operations, and it integrates the mask embeddings and image embeddings to generate the mask.
Figure 2: Design choices of the decoder. (a) Generating a mask by directly using the image embedding from the encoder (PDMM only). (b) PDMM and PRIM without IoU prediction head. (c) PDMM and PRIM without mask embedding fusion. (d) Our proposed DeSAM.
Figure 3: Quantitative results of different number of points.
Figure 4: Visual comparison of different methods for cross-site prostate segmentation and cross-modality abdominal multi-organ segmentation. GT represents the ground truth.

DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

TL;DR

Abstract

DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)