Table of Contents
Fetching ...

SAM-Med2D

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, Yu Qiao

TL;DR

This work addresses the domain gap between natural-image trained SAM and medical image segmentation by assembling the largest known medical segmentation dataset and fine-tuning SAM with encoder adapters and expanded prompts (points, boxes, and masks). SAM-Med2D demonstrates significant Dice gains over SAM and FT-SAM across diverse modalities and anatomical structures, with strong generalization to 9 MICCAI datasets and efficient single-point interactions. The approach enables robust, interactive medical image segmentation and sets a scalable foundation for domain-specific generalization. The authors also discuss limitations and future directions, including expanding prompts and data, and plan to release code and models.

Abstract

The Segment Anything Model (SAM) represents a state-of-the-art research advancement in natural image segmentation, achieving impressive results with input prompts such as points and bounding boxes. However, our evaluation and recent research indicate that directly applying the pretrained SAM to medical image segmentation does not yield satisfactory performance. This limitation primarily arises from significant domain gap between natural images and medical images. To bridge this gap, we introduce SAM-Med2D, the most comprehensive studies on applying SAM to medical 2D images. Specifically, we first collect and curate approximately 4.6M images and 19.7M masks from public and private datasets, constructing a large-scale medical image segmentation dataset encompassing various modalities and objects. Then, we comprehensively fine-tune SAM on this dataset and turn it into SAM-Med2D. Unlike previous methods that only adopt bounding box or point prompts as interactive segmentation approach, we adapt SAM to medical image segmentation through more comprehensive prompts involving bounding boxes, points, and masks. We additionally fine-tune the encoder and decoder of the original SAM to obtain a well-performed SAM-Med2D, leading to the most comprehensive fine-tuning strategies to date. Finally, we conducted a comprehensive evaluation and analysis to investigate the performance of SAM-Med2D in medical image segmentation across various modalities, anatomical structures, and organs. Concurrently, we validated the generalization capability of SAM-Med2D on 9 datasets from MICCAI 2023 challenge. Overall, our approach demonstrated significantly superior performance and generalization capability compared to SAM.

SAM-Med2D

TL;DR

This work addresses the domain gap between natural-image trained SAM and medical image segmentation by assembling the largest known medical segmentation dataset and fine-tuning SAM with encoder adapters and expanded prompts (points, boxes, and masks). SAM-Med2D demonstrates significant Dice gains over SAM and FT-SAM across diverse modalities and anatomical structures, with strong generalization to 9 MICCAI datasets and efficient single-point interactions. The approach enables robust, interactive medical image segmentation and sets a scalable foundation for domain-specific generalization. The authors also discuss limitations and future directions, including expanding prompts and data, and plan to release code and models.

Abstract

The Segment Anything Model (SAM) represents a state-of-the-art research advancement in natural image segmentation, achieving impressive results with input prompts such as points and bounding boxes. However, our evaluation and recent research indicate that directly applying the pretrained SAM to medical image segmentation does not yield satisfactory performance. This limitation primarily arises from significant domain gap between natural images and medical images. To bridge this gap, we introduce SAM-Med2D, the most comprehensive studies on applying SAM to medical 2D images. Specifically, we first collect and curate approximately 4.6M images and 19.7M masks from public and private datasets, constructing a large-scale medical image segmentation dataset encompassing various modalities and objects. Then, we comprehensively fine-tune SAM on this dataset and turn it into SAM-Med2D. Unlike previous methods that only adopt bounding box or point prompts as interactive segmentation approach, we adapt SAM to medical image segmentation through more comprehensive prompts involving bounding boxes, points, and masks. We additionally fine-tune the encoder and decoder of the original SAM to obtain a well-performed SAM-Med2D, leading to the most comprehensive fine-tuning strategies to date. Finally, we conducted a comprehensive evaluation and analysis to investigate the performance of SAM-Med2D in medical image segmentation across various modalities, anatomical structures, and organs. Concurrently, we validated the generalization capability of SAM-Med2D on 9 datasets from MICCAI 2023 challenge. Overall, our approach demonstrated significantly superior performance and generalization capability compared to SAM.
Paper Structure (11 sections, 7 figures, 4 tables)

This paper contains 11 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison between examples in SA-1B (a) and in our dataset (b). SA-1B consists of 11M natural images and their corresponding 1129M masks. Our dataset consists of 4.6M medical images and their corresponding 19.7M masks.
  • Figure 2: Results of interactive segmentation using SAM in various medical scenarios.
  • Figure 3: Overview of the dataset used in this study. (a) A total of 31 major organs, along with their corresponding anatomical structures, with an asterisk (*) denoting the presence of lesion labels within the dataset. (b) The distribution of modalities along with their corresponding proportions in the dataset are presented (scaled logarithmically). (c) The number of images and masks categorized by anatomical structure, along with the total count encompassing the dataset.
  • Figure 4: The pipeline of SAM-Med2D. We freeze the image encoder and incorporate learnable adapter layers in each Transformer block to acquire domain-specific knowledge in the medical field. We fine-tune the prompt encoder using point, Bbox, and mask information, while updating the parameters of the mask decoder through interactive training.
  • Figure 5: (a) Comparison from the perspective of anatomical structures. (b) Comparison from the perspective of different Modalities. (c) Comparison of segmentation performance between FT-SAM and our SAM-Med2D across 31 organs.
  • ...and 2 more figures