Table of Contents
Fetching ...

Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

Hangyul Yoon, Doohyuk Jang, Jungeun Kim, Eunho Yang

TL;DR

This work introduces Med-PerSAM, a novel and straightforward one-shot framework designed for the medical domain that outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Abstract

Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

TL;DR

This work introduces Med-PerSAM, a novel and straightforward one-shot framework designed for the medical domain that outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Abstract

Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Paper Structure

This paper contains 38 sections, 10 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Comparison of point prompts from PerSAM zhang2023personalize, Matcher liu2023matcher, and our method. Ground truth masks are shown in yellow, with positive and negative point prompts depicted in blue and red, respectively. Unlike the two SAM-based methods, which show over-clustering (pink arrows) or struggle to differentiate organs with similar pixel intensities (skyblue boxes), our approach accurately generates prompts and also performs well in challenging areas (lime green arrows). Additional examples can be found in Appendix \ref{['appendix:qual']}.
  • Figure 2: Overall Framework of Med-PerSAM. (1) Initially, the warping model is trained with warping ($\mathcal{L}_\text{warp}$) and augmentation ($\mathcal{L}_\text{aug}$) losses, and employs optical flow to generate a warped mask. (2) This serves as a mask prompt for SAM, while point and box prompts are extracted from the modules $g_{\texttt{point}}$ and $g_{\texttt{box}}$, respectively. (3) The resulting output is used to update the visual prompts and SAM output, (4) and the refined prediction mask is again utilized to retrain the warping model, which enhances the quality of the warped mask.
  • Figure 3: Overview of our point prompting strategy. (1) A prototype vector is defined as the average of foreground feature vectors from the reference image. (2) Erosion and dilation kernels are used to identify candidate regions for point prompts, which are then divided into subregions. (3) In each subregion, positive and negative point prompts are chosen based on the offsets that display the highest and lowest similarity to the class prototype vector, respectively.
  • Figure 4: Example of prompt refinement. The predicted outcome is indicated in yellow, while the positive and negative point prompts are marked in blue and red, respectively. Additional examples will be provided in Appendix \ref{['appendix:qual']}.
  • Figure 5: A qualitative comparison of the results from our model and other baseline models. Additional visualized examples of the main experiment and ablation studies will be provided in the Appendix \ref{['appendix:qual']}.
  • ...and 8 more figures