Table of Contents
Fetching ...

Prompt-Tuning SAM: From Generalist to Specialist with only 2048 Parameters and 16 Training Images

Tristan Piater, Björn Barz, Alexander Freytag

TL;DR

PTSAM addresses the domain shift and prompt-dependence limitations of SAM for microscopy and medical image segmentation by applying visual prompt tuning to both the mask decoder and the image encoder. By freezing the core SAM weights and training only $2048$ (MD) plus optionally $73728$ (IE) parameters, it delivers competitive or superior segmentation accuracy with as few as $16$ annotated images, and shows robustness under few-shot conditions. The method bridges the gap between generalist segmentation and domain-specific automation, outperforming or matching state-of-the-art SAM adaptations while dramatically reducing trainable parameters. This makes PTSAM a practical, out-of-the-box solution for automated, domain-specific segmentation tasks with limited data and resources.

Abstract

The Segment Anything Model (SAM) is widely used for segmenting a diverse range of objects in natural images from simple user prompts like points or bounding boxes. However, SAM's performance decreases substantially when applied to non-natural domains like microscopic imaging. Furthermore, due to SAM's interactive design, it requires a precise prompt for each image and object, which is unfeasible in many automated biomedical applications. Previous solutions adapt SAM by training millions of parameters via fine-tuning large parts of the model or of adapter layers. In contrast, we show that as little as 2,048 additional parameters are sufficient for turning SAM into a use-case specialist for a certain downstream task. Our novel PTSAM (prompt-tuned SAM) method uses prompt-tuning, a parameter-efficient fine-tuning technique, to adapt SAM for a specific task. We validate the performance of our approach on multiple microscopic and one medical dataset. Our results show that prompt-tuning only SAM's mask decoder already leads to a performance on-par with state-of-the-art techniques while requiring roughly 2,000x less trainable parameters. For addressing domain gaps, we find that additionally prompt-tuning SAM's image encoder is beneficial, further improving segmentation accuracy by up to 18% over state-of-the-art results. Since PTSAM can be reliably trained with as little as 16 annotated images, we find it particularly helpful for applications with limited training data and domain shifts.

Prompt-Tuning SAM: From Generalist to Specialist with only 2048 Parameters and 16 Training Images

TL;DR

PTSAM addresses the domain shift and prompt-dependence limitations of SAM for microscopy and medical image segmentation by applying visual prompt tuning to both the mask decoder and the image encoder. By freezing the core SAM weights and training only (MD) plus optionally (IE) parameters, it delivers competitive or superior segmentation accuracy with as few as annotated images, and shows robustness under few-shot conditions. The method bridges the gap between generalist segmentation and domain-specific automation, outperforming or matching state-of-the-art SAM adaptations while dramatically reducing trainable parameters. This makes PTSAM a practical, out-of-the-box solution for automated, domain-specific segmentation tasks with limited data and resources.

Abstract

The Segment Anything Model (SAM) is widely used for segmenting a diverse range of objects in natural images from simple user prompts like points or bounding boxes. However, SAM's performance decreases substantially when applied to non-natural domains like microscopic imaging. Furthermore, due to SAM's interactive design, it requires a precise prompt for each image and object, which is unfeasible in many automated biomedical applications. Previous solutions adapt SAM by training millions of parameters via fine-tuning large parts of the model or of adapter layers. In contrast, we show that as little as 2,048 additional parameters are sufficient for turning SAM into a use-case specialist for a certain downstream task. Our novel PTSAM (prompt-tuned SAM) method uses prompt-tuning, a parameter-efficient fine-tuning technique, to adapt SAM for a specific task. We validate the performance of our approach on multiple microscopic and one medical dataset. Our results show that prompt-tuning only SAM's mask decoder already leads to a performance on-par with state-of-the-art techniques while requiring roughly 2,000x less trainable parameters. For addressing domain gaps, we find that additionally prompt-tuning SAM's image encoder is beneficial, further improving segmentation accuracy by up to 18% over state-of-the-art results. Since PTSAM can be reliably trained with as little as 16 annotated images, we find it particularly helpful for applications with limited training data and domain shifts.

Paper Structure

This paper contains 22 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of Prompt-Tuning SAM (ours) to other techniques for adapting SAM, as well as nnU-Net as the de-facto standard. Circles indicate adapting the mask decoder only, while crosses indicate additional adaptation of the image encoder. Marker sizes correlate with the number of training images.
  • Figure 2: Overview of our PTSAM approach. We remove the prompt-encoder from the SAM architecture Kirillov2023SegmentA, which we keep frozen. Instead, we add learnable prompt parameters (green) into the mask decoder for becoming a use-case specialist and into the each layer of the image encoder to address domain shifts.
  • Figure 3: Comparing improvements of different methods when additionally tuning the image encoder. Higher values show that tuning the image encoder is beneficial.
  • Figure 4: Performance decrease of the methods, when training with 16.0 instead of 64.0 images. Lower values indicate, that the method requires more training data.
  • Figure 5: Ablation: investigating the effect of different numbers of prompts in mask decoder (MD) and image encoder (IE). For the blue MD curve, the image encoder is frozen, meaning $n_\textrm{ie} = 0$. For the red IE curve, we use $n_\textrm{md} = 8$. Note the two y-axes scales.
  • ...and 1 more figures