Segment Anything Model for Medical Image Analysis: an Experimental Study
Maciej A. Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, Yixin Zhang
TL;DR
This study evaluates the Segment Anything Model (SAM) on diverse medical imaging datasets to quantify its zero-shot segmentation capabilities and how prompting strategies affect performance. It systematically tests five non-iterative prompting modes, iterative prompting, and the segment-everything option across 28 tasks, comparing against leading interactive segmentation methods. Key findings show box prompts, especially when targeting separate object parts, yield the best average performance, while iterative prompts offer limited improvements for SAM; performance varies widely by dataset and object characteristics. The work provides practical guidance on prompting strategies for medical image segmentation and outlines avenues for adapting SAM to medical contexts and 3D data in future work.
Abstract
Training segmentation models for medical images continues to be challenging due to the limited availability of data annotations. Segment Anything Model (SAM) is a foundation model that is intended to segment user-defined objects of interest in an interactive manner. While the performance on natural images is impressive, medical image domains pose their own set of challenges. Here, we perform an extensive evaluation of SAM's ability to segment medical images on a collection of 19 medical imaging datasets from various modalities and anatomies. We report the following findings: (1) SAM's performance based on single prompts highly varies depending on the dataset and the task, from IoU=0.1135 for spine MRI to IoU=0.8650 for hip X-ray. (2) Segmentation performance appears to be better for well-circumscribed objects with prompts with less ambiguity and poorer in various other scenarios such as the segmentation of brain tumors. (3) SAM performs notably better with box prompts than with point prompts. (4) SAM outperforms similar methods RITM, SimpleClick, and FocalClick in almost all single-point prompt settings. (5) When multiple-point prompts are provided iteratively, SAM's performance generally improves only slightly while other methods' performance improves to the level that surpasses SAM's point-based performance. We also provide several illustrations for SAM's performance on all tested datasets, iterative segmentation, and SAM's behavior given prompt ambiguity. We conclude that SAM shows impressive zero-shot segmentation performance for certain medical imaging datasets, but moderate to poor performance for others. SAM has the potential to make a significant impact in automated medical image segmentation in medical imaging, but appropriate care needs to be applied when using it.
