Table of Contents
Fetching ...

How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images

Xinrong Hu, Xiaowei Xu, Yiyu Shi

TL;DR

This work tackles the domain gap between natural and medical images by efficiently adapting Segment Anything (SAM) to medical segmentation. By freezing the SAM encoder and introducing non-promptable prediction heads—especially a ViT-based AutoSAM and a CNN head—the authors demonstrate strong few-shot performance on MRI segmentation, outperforming training-from-scratch and self-supervised baselines. They show that AutoSAM can generate multi-class masks without prompts and that prediction-head choice and encoder size influence results, with AutoSAM and CNN excelling in label-scarce settings. The findings advocate for a practical, prompt-free adaptation of SAM as a foundation model for medical imaging, while outlining directions for broader validation and more advanced head designs.

Abstract

The emerging scale segmentation model, Segment Anything (SAM), exhibits impressive capabilities in zero-shot segmentation for natural images. However, when applied to medical images, SAM suffers from noticeable performance drop. To make SAM a real ``foundation model" for the computer vision community, it is critical to find an efficient way to customize SAM for medical image dataset. In this work, we propose to freeze SAM encoder and finetune a lightweight task-specific prediction head, as most of weights in SAM are contributed by the encoder. In addition, SAM is a promptable model, while prompt is not necessarily available in all application cases, and precise prompts for multiple class segmentation are also time-consuming. Therefore, we explore three types of prompt-free prediction heads in this work, include ViT, CNN, and linear layers. For ViT head, we remove the prompt tokens in the mask decoder of SAM, which is named AutoSAM. AutoSAM can also generate masks for different classes with one single inference after modification. To evaluate the label-efficiency of our finetuning method, we compare the results of these three prediction heads on a public medical image segmentation dataset with limited labeled data. Experiments demonstrate that finetuning SAM significantly improves its performance on medical image dataset, even with just one labeled volume. Moreover, AutoSAM and CNN prediction head also has better segmentation accuracy than training from scratch and self-supervised learning approaches when there is a shortage of annotations.

How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images

TL;DR

This work tackles the domain gap between natural and medical images by efficiently adapting Segment Anything (SAM) to medical segmentation. By freezing the SAM encoder and introducing non-promptable prediction heads—especially a ViT-based AutoSAM and a CNN head—the authors demonstrate strong few-shot performance on MRI segmentation, outperforming training-from-scratch and self-supervised baselines. They show that AutoSAM can generate multi-class masks without prompts and that prediction-head choice and encoder size influence results, with AutoSAM and CNN excelling in label-scarce settings. The findings advocate for a practical, prompt-free adaptation of SAM as a foundation model for medical imaging, while outlining directions for broader validation and more advanced head designs.

Abstract

The emerging scale segmentation model, Segment Anything (SAM), exhibits impressive capabilities in zero-shot segmentation for natural images. However, when applied to medical images, SAM suffers from noticeable performance drop. To make SAM a real ``foundation model" for the computer vision community, it is critical to find an efficient way to customize SAM for medical image dataset. In this work, we propose to freeze SAM encoder and finetune a lightweight task-specific prediction head, as most of weights in SAM are contributed by the encoder. In addition, SAM is a promptable model, while prompt is not necessarily available in all application cases, and precise prompts for multiple class segmentation are also time-consuming. Therefore, we explore three types of prompt-free prediction heads in this work, include ViT, CNN, and linear layers. For ViT head, we remove the prompt tokens in the mask decoder of SAM, which is named AutoSAM. AutoSAM can also generate masks for different classes with one single inference after modification. To evaluate the label-efficiency of our finetuning method, we compare the results of these three prediction heads on a public medical image segmentation dataset with limited labeled data. Experiments demonstrate that finetuning SAM significantly improves its performance on medical image dataset, even with just one labeled volume. Moreover, AutoSAM and CNN prediction head also has better segmentation accuracy than training from scratch and self-supervised learning approaches when there is a shortage of annotations.
Paper Structure (15 sections, 5 figures, 3 tables)

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: T-SNE plot of embeddings encoded by SAM's image encoder from four datasets. The four datasets are Synapsemulti-atlas, ACDCbernard2018deep, ADE20Kzhou2017scene, and COCOlin2014microsoft. As is showed, there is a apparent domain shift from natural images to medical images in latent space. This may explain why SAM fails to have good performance on unseen medical image datasets.
  • Figure 2: Comparisons of SAM inference process and our SAM finetuing process. We freeze the weights in SAM encoder, and adds various of prediction heads to generate segmentation mask without prompts, including Vision Transformer (ViT), CNN, and linear layer. Also, our model can generate masks of difference target objects.
  • Figure 3: Illustration of mask decoder in SAM and AutoSAM. AutoSAM removes the prompt token so that it requires no input prompt, from which the name "Auto" come. To enable multi-class segmentation at the same time, AutoSAM copies the pair of auxiliary embeddings and image embedding by the number of classes. Parallel computing can reduce the computation overhead associated with the duplicated embedding. The two-way attention includes self-attention blocks and cross-attention blocks.
  • Figure 4: Visualization of prediction masks on ACDC dataset using different methods. SAM(box) is a zero-shot approach with only box-style prompts, and the prompts for three different classes are given at the same time. "UNet", "Encoder + CNN", and "AutoSAM" are trained with only one labeld volume.
  • Figure 5: The change of dice score with respect to the number of labeled data in training. The results of UNet, UNet + SimCLR, and AutoSAM are included. Best viewed in colors.