Table of Contents
Fetching ...

RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2

Bin Xie, Hao Tang, Yan Yan, Gady Agam

TL;DR

RFMedSAM 2 addresses the limitations of SAM 2 in medical image segmentation by introducing automatic prompt refinement and architectural adapters. It achieves an upper-bound performance of $DSC=92.30\%$ on BTCV through SAM 2 fine-tuning with specialized adapters and frame strategies, surpassing $nnUNet$ by a substantial margin. To enable practical deployment without ground-truth prompts, RFMedSAM 2 embeds an independent U-Net-based prompt generator and conducts dual-stage refinements, delivering state-of-the-art results on AMOS22 ($DSC$ improved by $2.9\%$) and BTCV (prompted $DSC\approx92.3\%$, $+6.4\%$ over nnUNet). The work demonstrates that parameter-efficient adapters, refined memory-attention, and autonomous prompt generation can unlock SAM 2’s potential for robust, automated volumetric medical segmentation with real-world applicability.

Abstract

Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.

RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2

TL;DR

RFMedSAM 2 addresses the limitations of SAM 2 in medical image segmentation by introducing automatic prompt refinement and architectural adapters. It achieves an upper-bound performance of on BTCV through SAM 2 fine-tuning with specialized adapters and frame strategies, surpassing by a substantial margin. To enable practical deployment without ground-truth prompts, RFMedSAM 2 embeds an independent U-Net-based prompt generator and conducts dual-stage refinements, delivering state-of-the-art results on AMOS22 ( improved by ) and BTCV (prompted , over nnUNet). The work demonstrates that parameter-efficient adapters, refined memory-attention, and autonomous prompt generation can unlock SAM 2’s potential for robust, automated volumetric medical segmentation with real-world applicability.

Abstract

Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.

Paper Structure

This paper contains 31 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of our proposed RFMedSAM 2.
  • Figure 2: Overview of SAM 2. The pipeline includes steps for processing prompted and unprompted frames.
  • Figure 3: (1) Performance comparisons based on proposed methods. (2) Ablation studies for frame selection strategies. (3) Proposed Adapters. (4) Ablation studies for prompt generators.
  • Figure 4: Qualitative comparison on BTCV dataset. RFMedSAM 2 is the most precise for each class and has fewer segmentation outliers.
  • Figure 5: Comparisons with different output predictions for Step 0, Step 1, and Step 2.
  • ...and 4 more figures