Testing the Segment Anything Model on radiology data

José Guilherme de Almeida; Nuno M. Rodrigues; Sara Silva; Nickolas Papanikolaou

Testing the Segment Anything Model on radiology data

José Guilherme de Almeida, Nuno M. Rodrigues, Sara Silva, Nickolas Papanikolaou

TL;DR

This study critically evaluates SAM, a pioneering foundation segmentation model, on MRI data to probe its zero-shot capabilities in radiology. By applying standard and seeded inference modes and four segment-selection heuristics across MRI datasets from the Medical Segmentation Decathlon and ProstateX, the authors quantify performance with Dice and related metrics. They find SAM generally underperforms for radiology segmentation, with a best Dice of ${65\%}$ at IoU>${0.25}$ for the left atrium, and substantially poorer results for other tasks; seeds do not improve results. The work highlights the domain gap between natural-image-trained foundation models and radiology data, suggesting careful use of SAM in clinical settings and pointing to improvements from domain-specific adaptations and finetuning.

Abstract

Deep learning models trained with large amounts of data have become a recent and effective approach to predictive problem solving -- these have become known as "foundation models" as they can be used as fundamental tools for other applications. While the paramount examples of image classification (earlier) and large language models (more recently) led the way, the Segment Anything Model (SAM) was recently proposed and stands as the first foundation model for image segmentation, trained on over 10 million images and with recourse to over 1 billion masks. However, the question remains -- what are the limits of this foundation? Given that magnetic resonance imaging (MRI) stands as an important method of diagnosis, we sought to understand whether SAM could be used for a few tasks of zero-shot segmentation using MRI data. Particularly, we wanted to know if selecting masks from the pool of SAM predictions could lead to good segmentations. Here, we provide a critical assessment of the performance of SAM on magnetic resonance imaging data. We show that, while acceptable in a very limited set of cases, the overall trend implies that these models are insufficient for MRI segmentation across the whole volume, but can provide good segmentations in a few, specific slices. More importantly, we note that while foundation models trained on natural images are set to become key aspects of predictive modelling, they may prove ineffective when used on other imaging modalities.

Testing the Segment Anything Model on radiology data

TL;DR

at IoU>

for the left atrium, and substantially poorer results for other tasks; seeds do not improve results. The work highlights the domain gap between natural-image-trained foundation models and radiology data, suggesting careful use of SAM in clinical settings and pointing to improvements from domain-specific adaptations and finetuning.

Abstract

Paper Structure (18 sections, 6 figures, 2 tables)

This paper contains 18 sections, 6 figures, 2 tables.

Introduction
Context
Foundation models
Zero-shot segmentation
SAM in the medical domain
Annotation of 3D radiology data
Methods
Data
SAM inference
Selection of segments of interest
Performance evaluation
Code availability
Results
SAM underperforms on radiology data
SAM --- potentially useful in limited cases?
...and 3 more sections

Figures (6)

Figure 1: Intersection over union for the Segment Anything Model (SAM) when considering all slices in a volume (left) and when considering only slices with predictions. Each vertical facet corresponds to an anatomical zone and colours represent the source of the data (between ProstateX and the Medical Segmentation Decathlon datasets).
Figure 2: Dice scores for all predictions and its association with the number of slices with no segment detections. Each panel corresponds to an anatomical zone for a given dataset and colours correspond to Dice scores calculated using all slices (green) or only slices with detections (purple). The linear fit was achieved with locally estimated scatterplot smoothing (otherwise known as "loess"). Each point represents one volume.
Figure 3: Dice scores for all predictions and its association with the fraction of the object which was detected (dividing the number of pixels in the slices with the detected object by the total object size). The linear fit was achieved with locally estimated scatterplot smoothing (otherwise known as "loess"). Each point represents one volume (here, all 660 volumes are plotted).
Figure 4: Visualization of the $3 \times 3$ grid of seeds (point prompts) used as input to the SAM model using the ProstateX dataset. Points represent seeds and the white outline represents the contour of the prostate segmentation ground truth.
Figure 5: Comparison of the Dice scores obtained using standard SAM and seeded SAM. Each point represents a study in ProstateX (n=181) and colours correspond to different segment selection heuristics.
...and 1 more figures

Testing the Segment Anything Model on radiology data

TL;DR

Abstract

Testing the Segment Anything Model on radiology data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)