Table of Contents
Fetching ...

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

Meihua Li, Yang Zhang, Weizhao He, Hu Qu, Yisong Li

Abstract

Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

Abstract

Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

Paper Structure

This paper contains 14 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between our proposed method and previous methods. (a) Previous fully supervised methods build task-specific networks from scratch and require pixel-level annotations. They generate class prototypes from the limited support set and perform segmentation via feature matching. Lacking strong priors and trained on constrained data, these models are often brittle and struggle with complex visual variations. (b) Instead of building a new network, we adapt a powerful pre-trained foundation model and do not rely on manual annotations. Our framework steers its vast, generalizable visual priors, achieving superior robustness and generalization, especially in challenging cross-domain scenarios.
  • Figure 2: SD-FSMIS Overview and Training Pipeline. Support and query sets are first encoded using the VAE encoder $\mathcal{E}$. The query latent $z^{qi}$ is enhanced via the Query Enhancement module to obtain $z^{q}$, while the support latent $z^{si}$ and its mask latent $z^{sm}$ are concatenated along the channel dimension to form $z^{s}$. These are then fed into the U-Net, where the query mask latent $\hat{z}^{qm}$ is generated under the condition of the text embedding $E$, produced by the Visual-to-Textual Condition Translator module.
  • Figure 3: Modified BasicTransformerBlocks architecture.
  • Figure 4: Overview of the SD-FSMIS inference process.
  • Figure 5: Qualitative comparison between our method and DiffewS method on the Abd-MRI dataset and Abd-CT dataset.