Table of Contents
Fetching ...

Diffusion-empowered AutoPrompt MedSAM

Peng Huang, Shu Hu, Bo Peng, Xun Gong, Penghang Yin, Hongtu Zhu, Xi Wu, Xin Wang

TL;DR

AutoMedSAM addresses the reliance on labor-intensive manual prompts and the absence of semantic labeling in MedSAM by introducing a diffusion-based class prompt encoder and an uncertainty-aware joint optimization strategy. The framework preserves MedSAM’s image encoder and mask decoder while enabling end-to-end, class-conditioned segmentation through learned prompt embeddings generated by forward diffusion and dual-branch reverse diffusion. Across four diverse medical datasets, AutoMedSAM achieves state-of-the-art segmentation performance and strong cross-dataset generalization, outperforming SAM-Core and SAM-Based baselines while reducing the need for expert prompts. These results indicate significant practical impact for clinical workflows and non-expert users, with potential for broader adoption in multi-modal medical imaging in real-world settings.

Abstract

MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.

Diffusion-empowered AutoPrompt MedSAM

TL;DR

AutoMedSAM addresses the reliance on labor-intensive manual prompts and the absence of semantic labeling in MedSAM by introducing a diffusion-based class prompt encoder and an uncertainty-aware joint optimization strategy. The framework preserves MedSAM’s image encoder and mask decoder while enabling end-to-end, class-conditioned segmentation through learned prompt embeddings generated by forward diffusion and dual-branch reverse diffusion. Across four diverse medical datasets, AutoMedSAM achieves state-of-the-art segmentation performance and strong cross-dataset generalization, outperforming SAM-Core and SAM-Based baselines while reducing the need for expert prompts. These results indicate significant practical impact for clinical workflows and non-expert users, with potential for broader adoption in multi-modal medical imaging in real-world settings.

Abstract

MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.

Paper Structure

This paper contains 25 sections, 19 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison with SAM-based models.(Left) The original SAM model relies on manual prompts from medical experts, restricting its usability and scenarios. (Middle) Current SAM-based methods employ specialist models for prompt generation, but these models are organ- or lesion-specific, limiting SAM's generalizability. (Right) Our method introduces an automatic diffusion-based class prompt encoder, removing the need for explicit prompts, adding semantic labels to masks, and enabling accurate, end-to-end segmentation for non-experts in diverse medical contexts.
  • Figure 2: An overview of the AutoMedSAM. AutoMedSAM generates dense and sparse prompt embeddings through a diffusion-based class prompt encoder, eliminating the need for explicit prompts. During training, we employ an uncertainty-aware joint optimization strategy with multiple loss functions for supervision, while transferring MedSAM's pre-trained knowledge to AutoMedSAM. This approach improves training efficiency and generalization. With end-to-end inference, AutoMedSAM overcomes SAM's limitations, enhancing usability and expanding its application scope and user base.
  • Figure 3: Structure of the diffusion-based class prompt encoder. It is designed with an encoder and two independent decoder branches to extract local and global features, based on the practical significance of sparse and dense prompts. The use of prompt classes enables the model to more effectively focus on parts of the input related to specific classes, enhancing its ability to perceive and distinguish class-specific features, thereby improving the controllability and quality of the generation process.
  • Figure 4: The qualitative results of AutoMedSAM and other comparison models on AbdomenCT-1K. The bounding box represents the input prompt.
  • Figure 5: The qualitative analysis results of AutoMedSAM and other comparison models on BraTS, Kvasir-SEG, and Chest-XML.
  • ...and 1 more figures