Table of Contents
Fetching ...

MedSAM3: Delving into Segment Anything with Medical Concepts

Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, Jintai Chen

TL;DR

MedSAM-3 reframes medical segmentation by grounding open-vocabulary prompts in clinical concepts and adapting the SAM-3 foundation, addressing the semantic gaps of purely geometric prompting. It introduces MedSAM-3, a concept-driven segmentation backbone, and the MedSAM-3 Agent, which leverages multimodal LLMs for iterative, agent-in-the-loop refinement. Extensive evaluation across 2D, 3D, and video modalities shows that text+image prompts yield the strongest performance, with domain-specific fine-tuning improving semantic alignment and robustness. The agent component further boosts performance on complex, multi-step tasks, demonstrating a viable path toward scalable, semantic-aware medical image analysis in diverse clinical settings. The work offers a practical framework for transferring generalist segmentation capabilities to the clinic, alongside open-source code and models to catalyze adoption.

Abstract

Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

MedSAM3: Delving into Segment Anything with Medical Concepts

TL;DR

MedSAM-3 reframes medical segmentation by grounding open-vocabulary prompts in clinical concepts and adapting the SAM-3 foundation, addressing the semantic gaps of purely geometric prompting. It introduces MedSAM-3, a concept-driven segmentation backbone, and the MedSAM-3 Agent, which leverages multimodal LLMs for iterative, agent-in-the-loop refinement. Extensive evaluation across 2D, 3D, and video modalities shows that text+image prompts yield the strongest performance, with domain-specific fine-tuning improving semantic alignment and robustness. The agent component further boosts performance on complex, multi-step tasks, demonstrating a viable path toward scalable, semantic-aware medical image analysis in diverse clinical settings. The work offers a practical framework for transferring generalist segmentation capabilities to the clinic, alongside open-source code and models to catalyze adoption.

Abstract

Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

Paper Structure

This paper contains 18 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of concept-driven medical image and video segmentation across multiple modalities using MedSAM-3, highlighting that concise clinical concepts directly guide MedSAM-3 to produce reliable segmentations and thereby simplify physicians’ workflow.
  • Figure 2: Overview of MedSAM-3.
  • Figure 3: Overview of MedSAM-3 Agent refinement loop. The MedSAM-3 Agent plans and executes multi-step medical image segmentation using a MLLM, generating masks and refining them iteratively with visual and textual feedback.
  • Figure 4: Performance comparison between MedSAM-3 and competing methods on four medical datasets.
  • Figure 5: Visualization of the segmentation performance of MedSAM-3, SAM 3(both T+I versions), and other comparison methods.
  • ...and 4 more figures