Table of Contents
Fetching ...

AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Tal Shaharabany, Aviad Dahan, Raja Giryes, Lior Wolf

TL;DR

This work addresses SAM's limited performance on medical images by introducing AutoSAM, which replaces SAM's prompt encoder with a trainable image-conditioned encoder g(I) while freezing the main SAM network. The learned prompt, Z_I = g(I), is optimized via gradients from a segmentation loss $L_{seg}(I) = L_{BCE}(I,Z_I,M) + L_{dice}(I,Z_I,M)$, propagating through SAM to train g; a lightweight surrogate decoder $h$ maps $g(I)$ to a mask for interpretability. Empirically, AutoSAM delivers state-of-the-art segmentation across MoNuSeg, GlaS, multiple polyp datasets, and SUN-SEG video benchmarks without fine-tuning SAM, demonstrating strong OOD generalization and fully automatic operation. The results highlight the critical role of conditioning signals and suggest a path toward universal AutoSAM by learning one g that generalizes across medical imaging domains. Future work may pursue cross-domain universality of the image-conditioned prompt encoder for broad medical imaging tasks.

Abstract

The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.

AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

TL;DR

This work addresses SAM's limited performance on medical images by introducing AutoSAM, which replaces SAM's prompt encoder with a trainable image-conditioned encoder g(I) while freezing the main SAM network. The learned prompt, Z_I = g(I), is optimized via gradients from a segmentation loss , propagating through SAM to train g; a lightweight surrogate decoder maps to a mask for interpretability. Empirically, AutoSAM delivers state-of-the-art segmentation across MoNuSeg, GlaS, multiple polyp datasets, and SUN-SEG video benchmarks without fine-tuning SAM, demonstrating strong OOD generalization and fully automatic operation. The results highlight the critical role of conditioning signals and suggest a path toward universal AutoSAM by learning one g that generalizes across medical imaging domains. Future work may pursue cross-domain universality of the image-conditioned prompt encoder for broad medical imaging tasks.

Abstract

The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.
Paper Structure (11 sections, 4 equations, 5 figures, 3 tables)

This paper contains 11 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An example of segmenting an image from the Glas dataset. (a) the input image. (b) the ground truth mask. (c) the results of SAM with the GT mask provided to its mask encoder. (d) a point-based prompt. (e) SAM's result based on the point prompt. (f) our result, where the input image itself is given as a prompt to the prompt-encoder we train.
  • Figure 2: An illustration of AutoSAM. SAM's prompt encoder is replaced with our custom encoder while the image encoder and mask decoder are frozen.
  • Figure 3: Sample results of the proposed method on the Nucleus challenges (MoNuSeg) - rows 1,2. The gland segmentation dataset (Glas) rows 3,4. The Kvasir polyp segmentation dataset rows 5,6 where (a) Input image. (b) Ground truth segmentation. (c) The final segmentation map $M_z$. (d) output of SAM with our mask as input to the mask prompt encoder. (e) output of SAM with the ground truth mask as input to the same prompt encoder.
  • Figure 4: The results of the lightweight decoder $h$ on sample test images. The first row shows the input image $I$, the second row shows $h(g(I))$, which is the segmentation mask obtained with the surrogate decoder $h$, the third depicts the results of AutoSAM using the same $g(I)$, and the last row shows the ground-truth segmentation mask $M$.
  • Figure 5: A visual comparison of our solution to MedAdapterSAM wu2023medical for Glas and Monu datasets, where (a) input image (b) ground-truth mask (c) our solution (d) MedAdapterSAM wu2023medical output.