S-SAM: SVD-based Fine-Tuning of Segment Anything Model for Medical Image Segmentation
Jay N. Paranjape, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
TL;DR
This work addresses the practicality of applying foundation-model segmentation to medical images by reducing expert intervention and training cost. It introduces S-SAM, a SVD-based fine-tuning approach that only updates the singular values of SAM's image encoder while using label names as prompts, implemented via the transform $W \leftarrow U\mathrm{ReLU}(A\odot\Sigma+B)V^{T}$ with $W=U\Sigma V^{T}$. The method deploys a Text Affine Layer and learnable positional embeddings, freezing CLIP, the prompt encoder, and the mask decoder, and achieves state-of-the-art performance on five modalities with about 0.4% of SAM's trainable parameters. This yields practical, data-efficient segmentation suitable for clinical workflows, and the authors provide code for broader adoption. The work highlights significant efficiency gains over existing SAM adaptations and demonstrates robust cross-modality performance, while acknowledging class-size disparities as an area for future improvement.
Abstract
Medical image segmentation has been traditionally approached by training or fine-tuning the entire model to cater to any new modality or dataset. However, this approach often requires tuning a large number of parameters during training. With the introduction of the Segment Anything Model (SAM) for prompted segmentation of natural images, many efforts have been made towards adapting it efficiently for medical imaging, thus reducing the training time and resources. However, these methods still require expert annotations for every image in the form of point prompts or bounding box prompts during training and inference, making it tedious to employ them in practice. In this paper, we propose an adaptation technique, called S-SAM, that only trains parameters equal to 0.4% of SAM's parameters and at the same time uses simply the label names as prompts for producing precise masks. This not only makes tuning SAM more efficient than the existing adaptation methods but also removes the burden of providing expert prompts. We call this modified version S-SAM and evaluate it on five different modalities including endoscopic images, x-ray, ultrasound, CT, and histology images. Our experiments show that S-SAM outperforms state-of-the-art methods as well as existing SAM adaptation methods while tuning a significantly less number of parameters. We release the code for S-SAM at https://github.com/JayParanjape/SVDSAM.
