Segment Anything Model for Medical Images?
Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang, Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen, Chaoyu Chen, Sijing Liu, Haozhe Chi, Xindi Hu, Kejuan Yue, Lei Li, Vicente Grau, Deng-Ping Fan, Fajin Dong, Dong Ni
TL;DR
This work comprehensively evaluates the Segment Anything Model (SAM) for medical image segmentation (MIS) on the COSMOS 1050K dataset, a large, multi-modality collection designed to stress-test cross-domain MIS challenges. It analyzes SAM's performance under automatic Everything and various manual prompting modes (points and boxes) across ViT-B and ViT-H backbones, reporting that box prompts and the larger ViT-H backbone generally yield better, more stable results, while Everything mode remains weaker. The study also introduces a mask-matching evaluation, assesses inference efficiency via embedding caching, and demonstrates that SAM can significantly accelerate and improve annotation quality, though sensitivity to prompt randomness and boundary complexity persists. Additionally, task-specific finetuning (MedSAM) improves Dice scores substantially, highlighting the potential of adapting foundation segmentation models to MIS, while discussions point to future directions such as semantic awareness and 3D-consistent modeling. Overall, SAM shows promise as a general MIS tool but requires careful prompting, backbone choice, and domain-specific refinement to achieve reliable deployment.
Abstract
The Segment Anything Model (SAM) is the first foundation model for general image segmentation. It has achieved impressive results on various natural image segmentation tasks. However, medical image segmentation (MIS) is more challenging because of the complex modalities, fine anatomical structures, uncertain and complex object boundaries, and wide-range object scales. To fully validate SAM's performance on medical data, we collected and sorted 53 open-source datasets and built a large medical segmentation dataset with 18 modalities, 84 objects, 125 object-modality paired targets, 1050K 2D images, and 6033K masks. We comprehensively analyzed different models and strategies on the so-called COSMOS 1050K dataset. Our findings mainly include the following: 1) SAM showed remarkable performance in some specific objects but was unstable, imperfect, or even totally failed in other situations. 2) SAM with the large ViT-H showed better overall performance than that with the small ViT-B. 3) SAM performed better with manual hints, especially box, than the Everything mode. 4) SAM could help human annotation with high labeling quality and less time. 5) SAM was sensitive to the randomness in the center point and tight box prompts, and may suffer from a serious performance drop. 6) SAM performed better than interactive methods with one or a few points, but will be outpaced as the number of points increases. 7) SAM's performance correlated to different factors, including boundary complexity, intensity differences, etc. 8) Finetuning the SAM on specific medical tasks could improve its average DICE performance by 4.39% and 6.68% for ViT-B and ViT-H, respectively. We hope that this comprehensive report can help researchers explore the potential of SAM applications in MIS, and guide how to appropriately use and develop SAM.
