Table of Contents
Fetching ...

Few-Shot Adaptation of Training-Free Foundation Model for 3D Medical Image Segmentation

Xingxin He, Yifan Hu, Zhaoye Zhou, Mohamed Jarraya, Fang Liu

TL;DR

The paper addresses the challenge of medical-image segmentation when large labeled datasets and manual prompts are impractical by introducing FATE-SAM, a training-free, prompt-free adaptation that reuses SAM2 modules with few-shot support to guide 3D segmentation. It employs a memory-based pipeline with image encoding, support retrieval, memory encoding (anatomical and volumetric), memory attention, and mask decoding to yield coherent masks across 3D volumes. Key contributions include the Volumetric Consistency mechanism, a retrieval-based few-shot strategy, and extensive ablations showing robustness across 11 tasks and 34 anatomical objects without fine-tuning. The approach offers practical benefits for clinical deployment by reducing data requirements and expert intervention while delivering competitive segmentation performance across modalities and anatomies, though it incurs some computational overhead and can struggle with very small structures.

Abstract

Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications.

Few-Shot Adaptation of Training-Free Foundation Model for 3D Medical Image Segmentation

TL;DR

The paper addresses the challenge of medical-image segmentation when large labeled datasets and manual prompts are impractical by introducing FATE-SAM, a training-free, prompt-free adaptation that reuses SAM2 modules with few-shot support to guide 3D segmentation. It employs a memory-based pipeline with image encoding, support retrieval, memory encoding (anatomical and volumetric), memory attention, and mask decoding to yield coherent masks across 3D volumes. Key contributions include the Volumetric Consistency mechanism, a retrieval-based few-shot strategy, and extensive ablations showing robustness across 11 tasks and 34 anatomical objects without fine-tuning. The approach offers practical benefits for clinical deployment by reducing data requirements and expert intervention while delivering competitive segmentation performance across modalities and anatomies, though it incurs some computational overhead and can struggle with very small structures.

Abstract

Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications.
Paper Structure (19 sections, 9 equations, 5 figures, 8 tables)

This paper contains 19 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Applications of SAM for medical image segmentation: a) Zero-Shot Inference: Relies on manual prompts and struggles with complex anatomical structures due to the absence of medical domain-specific training. b) Medical Image Fine-Tuning: Enhances domain adaptation by fine-tuning on medical datasets. However, it demands extensive data collection and also relies on manual prompts. c) Our Method: A training-free and prompt-free adaptation approach that eliminates the need for large annotated datasets by leveraging few-shot examples, which is used as memory guidance, enabling fully automated and anatomically-aware segmentation of complex structures.
  • Figure 2: The Method Pipeline: a) Image Encoding and Support Retrieval: The test slice and all support slices set are encoded into image embeddings, and the most similar support examples are retrieved by ranking the feature similarities between the support image embeddings and the test image embeddings. b) Memory Encoding and Volumetric Consistency: Support examples are encoded into anatomical memory embeddings, while adjacent predictions are encoded into volumetric memory embeddings. These two types of memory are then fused to create unified memory embeddings, integrating both anatomical knowledge and volumetric consistency. c) Memory Attention and Mask Decoding: The unified memory embeddings are integrated into the test image embeddings through memory attention. This enriched representation guides the segmentation process, enabling the generation of accurate and coherent predictions.
  • Figure 3: Box plots comparison of Dice scores (%) across 8 competitive methods and our method, evaluated on 11 tasks of interest. Each point indicates the average Dice score on a task. Each box represents the interquartile range (IQR) of the Dice scores for a specific method, with the horizontal line indicating the median. The whiskers extend to the minimum and maximum values within 1.5 times the IQR.
  • Figure 4: Visual examples of segmentation results across datasets for different methods. Pink boxes enlarge the segmented objects for enhanced visualization. GT represents the ground truth masks.
  • Figure 5: Size of support set and support example ablation results on the SKI10 dataset. The top-left example illustrates the anatomical structures of the femur bone, tibia bone, femoral cartilage, and tibial cartilage. The plots display the segmentation performance of our method with varying sizes of the support set and the number of support examples. The horizontal axis indicates the number of support examples, while lines in different colors represent results for different sizes of the support set. Each sub-figure highlights the ablation performance for a specific anatomical structure.