Table of Contents
Fetching ...

S-SAM: SVD-based Fine-Tuning of Segment Anything Model for Medical Image Segmentation

Jay N. Paranjape, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

TL;DR

This work addresses the practicality of applying foundation-model segmentation to medical images by reducing expert intervention and training cost. It introduces S-SAM, a SVD-based fine-tuning approach that only updates the singular values of SAM's image encoder while using label names as prompts, implemented via the transform $W \leftarrow U\mathrm{ReLU}(A\odot\Sigma+B)V^{T}$ with $W=U\Sigma V^{T}$. The method deploys a Text Affine Layer and learnable positional embeddings, freezing CLIP, the prompt encoder, and the mask decoder, and achieves state-of-the-art performance on five modalities with about 0.4% of SAM's trainable parameters. This yields practical, data-efficient segmentation suitable for clinical workflows, and the authors provide code for broader adoption. The work highlights significant efficiency gains over existing SAM adaptations and demonstrates robust cross-modality performance, while acknowledging class-size disparities as an area for future improvement.

Abstract

Medical image segmentation has been traditionally approached by training or fine-tuning the entire model to cater to any new modality or dataset. However, this approach often requires tuning a large number of parameters during training. With the introduction of the Segment Anything Model (SAM) for prompted segmentation of natural images, many efforts have been made towards adapting it efficiently for medical imaging, thus reducing the training time and resources. However, these methods still require expert annotations for every image in the form of point prompts or bounding box prompts during training and inference, making it tedious to employ them in practice. In this paper, we propose an adaptation technique, called S-SAM, that only trains parameters equal to 0.4% of SAM's parameters and at the same time uses simply the label names as prompts for producing precise masks. This not only makes tuning SAM more efficient than the existing adaptation methods but also removes the burden of providing expert prompts. We call this modified version S-SAM and evaluate it on five different modalities including endoscopic images, x-ray, ultrasound, CT, and histology images. Our experiments show that S-SAM outperforms state-of-the-art methods as well as existing SAM adaptation methods while tuning a significantly less number of parameters. We release the code for S-SAM at https://github.com/JayParanjape/SVDSAM.

S-SAM: SVD-based Fine-Tuning of Segment Anything Model for Medical Image Segmentation

TL;DR

This work addresses the practicality of applying foundation-model segmentation to medical images by reducing expert intervention and training cost. It introduces S-SAM, a SVD-based fine-tuning approach that only updates the singular values of SAM's image encoder while using label names as prompts, implemented via the transform with . The method deploys a Text Affine Layer and learnable positional embeddings, freezing CLIP, the prompt encoder, and the mask decoder, and achieves state-of-the-art performance on five modalities with about 0.4% of SAM's trainable parameters. This yields practical, data-efficient segmentation suitable for clinical workflows, and the authors provide code for broader adoption. The work highlights significant efficiency gains over existing SAM adaptations and demonstrates robust cross-modality performance, while acknowledging class-size disparities as an area for future improvement.

Abstract

Medical image segmentation has been traditionally approached by training or fine-tuning the entire model to cater to any new modality or dataset. However, this approach often requires tuning a large number of parameters during training. With the introduction of the Segment Anything Model (SAM) for prompted segmentation of natural images, many efforts have been made towards adapting it efficiently for medical imaging, thus reducing the training time and resources. However, these methods still require expert annotations for every image in the form of point prompts or bounding box prompts during training and inference, making it tedious to employ them in practice. In this paper, we propose an adaptation technique, called S-SAM, that only trains parameters equal to 0.4% of SAM's parameters and at the same time uses simply the label names as prompts for producing precise masks. This not only makes tuning SAM more efficient than the existing adaptation methods but also removes the burden of providing expert prompts. We call this modified version S-SAM and evaluate it on five different modalities including endoscopic images, x-ray, ultrasound, CT, and histology images. Our experiments show that S-SAM outperforms state-of-the-art methods as well as existing SAM adaptation methods while tuning a significantly less number of parameters. We release the code for S-SAM at https://github.com/JayParanjape/SVDSAM.
Paper Structure (6 sections, 2 equations, 4 figures, 4 tables)

This paper contains 6 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: S-SAM Architecture. The image encoder weights are modified by performing a transformation over their singular values. Other trainable parameters include the layernorms and positional embeddings in the encoder and the Text Affine Layer (TAL). Everything else is frozen and initialized with SAM's pre-trained checkpoint.
  • Figure 2: Comparison of different fine-tuning methods. (a) Naive fine-tuning (b) LoRA (c) Our approach only tunes the singular values and is even more efficient than LoRA.
  • Figure 3: A qualitative comparison among different methods. From the top, the rows represent CholecSeg8k, Ultrasound, ChestXDet, LiTS, and GLAS, respectively. The green dot in the last column denotes the point prompt used to query SAM.
  • Figure 4: A comparison among different methods based on the number of parameters trained. The red bars indicate traditional DL-based segmentation methods. Blue bars indicate SAM-based methods and green bars indicate our method. The numbers to the right of each bar denote the number of trainable parameters.