Table of Contents
Fetching ...

Slide-SAM: Medical SAM Meets Sliding Window

Quan Quan, Fenghe Tang, Zikang Xu, Heqin Zhu, S. Kevin Zhou

TL;DR

This work addresses the challenge of applying a pre-trained 2D segmentation model (SAM) to 3D medical images by introducing Slide-SAM, which uses a three-slice sliding window to predict simultaneous masks across adjacent slices with prompts on the central slice. It preserves SAM’s pre-trained strengths by freezing the backbone and reusing decoder weights, while enabling efficient multi-slice inference through LoRA-based fine-tuning and a hybrid loss that accommodates both 3D labels and SAM-generated 2D pseudo-labels. Empirical results across CHAOS, BTCV, WORD, and MSD datasets show improved 3D segmentation with minimal prompts, enhanced annotation efficiency, and robustness to noisy prompts, highlighting Slide-SAM’s potential to accelerate clinical annotation workflows. The approach combines architectural adaptation, data enrichment, and task-aligned loss to achieve coherent 3D segmentations with practical inference speed and memory usage improvements.

Abstract

The Segment Anything Model (SAM) has achieved a notable success in two-dimensional image segmentation in natural images. However, the substantial gap between medical and natural images hinders its direct application to medical image segmentation tasks. Particularly in 3D medical images, SAM struggles to learn contextual relationships between slices, limiting its practical applicability. Moreover, applying 2D SAM to 3D images requires prompting the entire volume, which is time- and label-consuming. To address these problems, we propose Slide-SAM, which treats a stack of three adjacent slices as a prediction window. It firstly takes three slices from a 3D volume and point- or bounding box prompts on the central slice as inputs to predict segmentation masks for all three slices. Subsequently, the masks of the top and bottom slices are then used to generate new prompts for adjacent slices. Finally, step-wise prediction can be achieved by sliding the prediction window forward or backward through the entire volume. Our model is trained on multiple public and private medical datasets and demonstrates its effectiveness through extensive 3D segmetnation experiments, with the help of minimal prompts. Code is available at \url{https://github.com/Curli-quan/Slide-SAM}.

Slide-SAM: Medical SAM Meets Sliding Window

TL;DR

This work addresses the challenge of applying a pre-trained 2D segmentation model (SAM) to 3D medical images by introducing Slide-SAM, which uses a three-slice sliding window to predict simultaneous masks across adjacent slices with prompts on the central slice. It preserves SAM’s pre-trained strengths by freezing the backbone and reusing decoder weights, while enabling efficient multi-slice inference through LoRA-based fine-tuning and a hybrid loss that accommodates both 3D labels and SAM-generated 2D pseudo-labels. Empirical results across CHAOS, BTCV, WORD, and MSD datasets show improved 3D segmentation with minimal prompts, enhanced annotation efficiency, and robustness to noisy prompts, highlighting Slide-SAM’s potential to accelerate clinical annotation workflows. The approach combines architectural adaptation, data enrichment, and task-aligned loss to achieve coherent 3D segmentations with practical inference speed and memory usage improvements.

Abstract

The Segment Anything Model (SAM) has achieved a notable success in two-dimensional image segmentation in natural images. However, the substantial gap between medical and natural images hinders its direct application to medical image segmentation tasks. Particularly in 3D medical images, SAM struggles to learn contextual relationships between slices, limiting its practical applicability. Moreover, applying 2D SAM to 3D images requires prompting the entire volume, which is time- and label-consuming. To address these problems, we propose Slide-SAM, which treats a stack of three adjacent slices as a prediction window. It firstly takes three slices from a 3D volume and point- or bounding box prompts on the central slice as inputs to predict segmentation masks for all three slices. Subsequently, the masks of the top and bottom slices are then used to generate new prompts for adjacent slices. Finally, step-wise prediction can be achieved by sliding the prediction window forward or backward through the entire volume. Our model is trained on multiple public and private medical datasets and demonstrates its effectiveness through extensive 3D segmetnation experiments, with the help of minimal prompts. Code is available at \url{https://github.com/Curli-quan/Slide-SAM}.
Paper Structure (20 sections, 4 equations, 8 figures, 6 tables)

This paper contains 20 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The training pipeline of Slide-SAM. First, Three adjacent slices are used as input and fed into the backbone network. Then, the Prompt encoder is employed to encode points or boxes. The Mask decoder receives the generated features from the previous step and generates masks for each slice using different heads. The hybrid loss is only computed for layers with labels.
  • Figure 2: The inference process of Slide-SAM.
  • Figure 3: Labeling efficiency: The number of images that can be annotated using 1000 prompts for WORD testset.
  • Figure 4: Visual comparison on the CHAOS dataset.
  • Figure 5: Predictions of BTCV testset with different noisy prompts. We display 3 slices and their masks in RGB format.
  • ...and 3 more figures