Table of Contents
Fetching ...

Segment Anything Model for Medical Images?

Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang, Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen, Chaoyu Chen, Sijing Liu, Haozhe Chi, Xindi Hu, Kejuan Yue, Lei Li, Vicente Grau, Deng-Ping Fan, Fajin Dong, Dong Ni

TL;DR

This work comprehensively evaluates the Segment Anything Model (SAM) for medical image segmentation (MIS) on the COSMOS 1050K dataset, a large, multi-modality collection designed to stress-test cross-domain MIS challenges. It analyzes SAM's performance under automatic Everything and various manual prompting modes (points and boxes) across ViT-B and ViT-H backbones, reporting that box prompts and the larger ViT-H backbone generally yield better, more stable results, while Everything mode remains weaker. The study also introduces a mask-matching evaluation, assesses inference efficiency via embedding caching, and demonstrates that SAM can significantly accelerate and improve annotation quality, though sensitivity to prompt randomness and boundary complexity persists. Additionally, task-specific finetuning (MedSAM) improves Dice scores substantially, highlighting the potential of adapting foundation segmentation models to MIS, while discussions point to future directions such as semantic awareness and 3D-consistent modeling. Overall, SAM shows promise as a general MIS tool but requires careful prompting, backbone choice, and domain-specific refinement to achieve reliable deployment.

Abstract

The Segment Anything Model (SAM) is the first foundation model for general image segmentation. It has achieved impressive results on various natural image segmentation tasks. However, medical image segmentation (MIS) is more challenging because of the complex modalities, fine anatomical structures, uncertain and complex object boundaries, and wide-range object scales. To fully validate SAM's performance on medical data, we collected and sorted 53 open-source datasets and built a large medical segmentation dataset with 18 modalities, 84 objects, 125 object-modality paired targets, 1050K 2D images, and 6033K masks. We comprehensively analyzed different models and strategies on the so-called COSMOS 1050K dataset. Our findings mainly include the following: 1) SAM showed remarkable performance in some specific objects but was unstable, imperfect, or even totally failed in other situations. 2) SAM with the large ViT-H showed better overall performance than that with the small ViT-B. 3) SAM performed better with manual hints, especially box, than the Everything mode. 4) SAM could help human annotation with high labeling quality and less time. 5) SAM was sensitive to the randomness in the center point and tight box prompts, and may suffer from a serious performance drop. 6) SAM performed better than interactive methods with one or a few points, but will be outpaced as the number of points increases. 7) SAM's performance correlated to different factors, including boundary complexity, intensity differences, etc. 8) Finetuning the SAM on specific medical tasks could improve its average DICE performance by 4.39% and 6.68% for ViT-B and ViT-H, respectively. We hope that this comprehensive report can help researchers explore the potential of SAM applications in MIS, and guide how to appropriately use and develop SAM.

Segment Anything Model for Medical Images?

TL;DR

This work comprehensively evaluates the Segment Anything Model (SAM) for medical image segmentation (MIS) on the COSMOS 1050K dataset, a large, multi-modality collection designed to stress-test cross-domain MIS challenges. It analyzes SAM's performance under automatic Everything and various manual prompting modes (points and boxes) across ViT-B and ViT-H backbones, reporting that box prompts and the larger ViT-H backbone generally yield better, more stable results, while Everything mode remains weaker. The study also introduces a mask-matching evaluation, assesses inference efficiency via embedding caching, and demonstrates that SAM can significantly accelerate and improve annotation quality, though sensitivity to prompt randomness and boundary complexity persists. Additionally, task-specific finetuning (MedSAM) improves Dice scores substantially, highlighting the potential of adapting foundation segmentation models to MIS, while discussions point to future directions such as semantic awareness and 3D-consistent modeling. Overall, SAM shows promise as a general MIS tool but requires careful prompting, backbone choice, and domain-specific refinement to achieve reliable deployment.

Abstract

The Segment Anything Model (SAM) is the first foundation model for general image segmentation. It has achieved impressive results on various natural image segmentation tasks. However, medical image segmentation (MIS) is more challenging because of the complex modalities, fine anatomical structures, uncertain and complex object boundaries, and wide-range object scales. To fully validate SAM's performance on medical data, we collected and sorted 53 open-source datasets and built a large medical segmentation dataset with 18 modalities, 84 objects, 125 object-modality paired targets, 1050K 2D images, and 6033K masks. We comprehensively analyzed different models and strategies on the so-called COSMOS 1050K dataset. Our findings mainly include the following: 1) SAM showed remarkable performance in some specific objects but was unstable, imperfect, or even totally failed in other situations. 2) SAM with the large ViT-H showed better overall performance than that with the small ViT-B. 3) SAM performed better with manual hints, especially box, than the Everything mode. 4) SAM could help human annotation with high labeling quality and less time. 5) SAM was sensitive to the randomness in the center point and tight box prompts, and may suffer from a serious performance drop. 6) SAM performed better than interactive methods with one or a few points, but will be outpaced as the number of points increases. 7) SAM's performance correlated to different factors, including boundary complexity, intensity differences, etc. 8) Finetuning the SAM on specific medical tasks could improve its average DICE performance by 4.39% and 6.68% for ViT-B and ViT-H, respectively. We hope that this comprehensive report can help researchers explore the potential of SAM applications in MIS, and guide how to appropriately use and develop SAM.
Paper Structure (29 sections, 2 equations, 18 figures, 8 tables)

This paper contains 29 sections, 2 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Our COSMOS 1050K dataset contains various modalities involving (a) CT, (b) MRI, (c) T1-weighted (T1W) MRI, (d) T2-weighted (T2W) MRI, (e) ADC MRI, (f) Cine-MRI, (g) CMR, (h) diffusion-weighted (DW) MRI, (i) post-contrast T1-weighted (T1-GD) MRI, (j) T2 Fluid Attenuated Inversion Recovery (T2-FLAIR) MRI, (k) Histopathology, (l) Electron Microscopy, (m) Ultrasound (US), (n) X-ray, (o) Fundus, (p) Colonoscopy, (q) Dermoscopy, and (r) Microscopy.
  • Figure 2: Our COSMOS 1050K dataset covers the majority of biomedical objects, for example, brain tumors, fundus vasculature, thyroid nodules, spine, lung, heart, abdominal organs and tumors, cell, polyp, and instrument.
  • Figure 3: Statistics of COSMOS 1050K dataset. (a) Number of datasets after preprocessing. (b) Histogram distribution of 84 objects' quantity, as indicated by the abbreviated mapping provided in the legend. (c) Number of Modalities. (d) Histogram distribution of image resolutions. In (d), each bar represents an area interval distribution, e.g.,$128*128$ represents the image area interval (0, $128*128$); $256*256$ represents the image area interval ($128*128$, $256*256$).
  • Figure 4: Typical examples of meeting the exclusion criteria. (a) cochlea (criteria 1), (b) intestine (criteria 2), (c) histopathological breast cancer (criteria 3), and (d) lung trachea trees (criteria 3). The corners (b) and (d) show the 3D rendering images obtained by Pair annotation software package liang2022sketch.
  • Figure 5: Testing pipeline of SAM in our study.
  • ...and 13 more figures