How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

Hanxue Gu; Haoyu Dong; Jichen Yang; Maciej A. Mazurowski

How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

Hanxue Gu, Haoyu Dong, Jichen Yang, Maciej A. Mazurowski

TL;DR

This study assesses how to adapt the Segment Anything Model for medical image segmentation under varied data availability: single-task, multi-task, few-shot, and interactive scenarios. It systematically compares encoder backbones, update scopes, and parameter-efficient fine-tuning methods across 18 configurations on 17 radiology datasets, and investigates task-expansive prompting and self-supervised pretraining. Key findings show that parameter-efficient fine-tuning on both encoder and decoder yields strong performance gains, larger backbones may not substantially improve results due to data constraints, and self-supervised pretraining can boost MRI segmentation but prompt-based pretraining offers limited automation benefits. The work provides practical guidance and releases MRI-specific fine-tuned weights and code to advance SAM-based medical image segmentation practice.

Abstract

Automated segmentation is a fundamental medical image analysis task, which enjoys significant advances due to the advent of deep learning. While foundation models have been useful in natural language processing and some vision tasks for some time, the foundation model developed with image segmentation in mind - Segment Anything Model (SAM) - has been developed only recently and has shown similar promise. However, there are still no systematic analyses or "best-practice" guidelines for optimal fine-tuning of SAM for medical image segmentation. This work summarizes existing fine-tuning strategies with various backbone architectures, model components, and fine-tuning algorithms across 18 combinations, and evaluates them on 17 datasets covering all common radiology modalities. Our study reveals that (1) fine-tuning SAM leads to slightly better performance than previous segmentation methods, (2) fine-tuning strategies that use parameter-efficient learning in both the encoder and decoder are superior to other strategies, (3) network architecture has a small impact on final performance, (4) further training SAM with self-supervised learning can improve final model performance. We also demonstrate the ineffectiveness of some methods popular in the literature and further expand our experiments into few-shot and prompt-based settings. Lastly, we released our code and MRI-specific fine-tuned weights, which consistently obtained superior performance over the original SAM, at https://github.com/mazurowski-lab/finetune-SAM.

How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

TL;DR

Abstract

Paper Structure (31 sections, 12 figures, 21 tables)

This paper contains 31 sections, 12 figures, 21 tables.

Introduction
Method
SAM and removal of prompts
Interactive segmentation to automated segmentation
Variables in task-specific supervised training
Image Encoder Architecture
Model Component
Vanilla or Parameter-efficient Fine-tuning Method
Summary
Task-general Prompt-based Learning
Task-agnostic Self-supervised Learning
Datasets
Experiments
Which fine-tuning combination is the best if applying task-specific supervised training only?
Implementation details
...and 16 more sections

Figures (12)

Figure 1: Overview of general fine-tuning strategies based on different levels of dataset availability.
Figure 2: Visualization of the overview of the SAM's architecture and task-specific fine-tuning architectures selected in our study: including 3 encoder architecture $\times$ 2 model updating range $\times$ 3 vanilla/PEFT methods = 18 choices.
Figure 3: A visual summary of the collected datasets, including the visual example from each dataset and the number of images under different criteria.
Figure 4: Left: Illustration of the dataset and training setting: task-specific supervised training; Middle: Visualization of the performance of various task-specific supervised training strategies, including various Image-Encoder depth: 4 (ViT-T) or 12 (ViT-B) or 32 (ViT-H); updating all parameters (V) or use PEFT-Adapter (A) or PEFT-LoRA (L); fine-tuning Mask Decoder only (De) or Image Encoder + Mask Decoder (En/De), and UNet and its variances. Right: Difference between the best fine-tuned SAM model (ViT-H + En/Decoder + LoRA) and the best UNet-based model (nnUNet) across individual datasets.
Figure 5: Validation Dice Coefficient Score (DSC) with respect to training epoch for different combinations of fine-tuning strategies. We select one representative dataset from each modality for visualization, and the behavior is consistent across different datasets. Note that the total number of epochs is not consistent because the training stops when the validation DSC is not improved for 20 epochs.
...and 7 more figures

How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

TL;DR

Abstract

How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

Authors

TL;DR

Abstract

Table of Contents

Figures (12)