AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder
Tal Shaharabany, Aviad Dahan, Raja Giryes, Lior Wolf
TL;DR
This work addresses SAM's limited performance on medical images by introducing AutoSAM, which replaces SAM's prompt encoder with a trainable image-conditioned encoder g(I) while freezing the main SAM network. The learned prompt, Z_I = g(I), is optimized via gradients from a segmentation loss $L_{seg}(I) = L_{BCE}(I,Z_I,M) + L_{dice}(I,Z_I,M)$, propagating through SAM to train g; a lightweight surrogate decoder $h$ maps $g(I)$ to a mask for interpretability. Empirically, AutoSAM delivers state-of-the-art segmentation across MoNuSeg, GlaS, multiple polyp datasets, and SUN-SEG video benchmarks without fine-tuning SAM, demonstrating strong OOD generalization and fully automatic operation. The results highlight the critical role of conditioning signals and suggest a path toward universal AutoSAM by learning one g that generalizes across medical imaging domains. Future work may pursue cross-domain universality of the image-conditioned prompt encoder for broad medical imaging tasks.
Abstract
The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.
