Table of Contents
Fetching ...

Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

TL;DR

Diffusion-model-based medical image segmentation is powerful but computationally intensive, often requiring multi-step reverse processes and multiple samples for reliable predictions. This work introduces SDSeg, a latent-diffusion segmentation model built on Stable Diffusion that uses a simple latent estimation loss to enable a single-step reverse and a concatenate latent fusion strategy to avoid multiple samples, complemented by a trainable vision encoder for cross-domain adaptability. SDSeg achieves state-of-the-art performance on five datasets spanning RGB 2D and CT 3D modalities while dramatically reducing training requirements and enabling fast, stable inference. The approach offers a practical, scalable solution for automated medical image segmentation with reliable outputs.

Abstract

Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg

Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

TL;DR

Diffusion-model-based medical image segmentation is powerful but computationally intensive, often requiring multi-step reverse processes and multiple samples for reliable predictions. This work introduces SDSeg, a latent-diffusion segmentation model built on Stable Diffusion that uses a simple latent estimation loss to enable a single-step reverse and a concatenate latent fusion strategy to avoid multiple samples, complemented by a trainable vision encoder for cross-domain adaptability. SDSeg achieves state-of-the-art performance on five datasets spanning RGB 2D and CT 3D modalities while dramatically reducing training requirements and enabling fast, stable inference. The approach offers a practical, scalable solution for automated medical image segmentation with reliable outputs.

Abstract

Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The overview of SDSeg. We condition SDSeg via concatenation. In the training stage, we only train the denoising U-Net and vision encoder.
  • Figure 2: Visualization of reconstructions and latent representations on BTCV, STS, REF, and CVC. Reconstructions denotes $\widetilde{X}=\mathcal{D}(z)$ where latent $z=\mathcal{E}(X)$.
  • Figure 3: Comparison of DDIM convergence speed with and without latent estimation loss on BTCV. $\lambda=1$ denotes that SDSeg is trained on latent estimation loss $\mathcal{L}_{latent}$.
  • Figure 4: Illustration of our Stability Evaluation on REF. We first conduct $M$ times of inference process to prepare for the evaluation. Then, Dataset-level Stability is evaluated on every two sets of the inference results; Instance-level Stability is estimated on every two segmentation maps of each image conditioning.
  • Figure 5: From top to bottom: Visualization of the predicted probability maps in reverse process on CVC, BTCV, and KSEG (SDSeg trained for 50,000 steps). The horizontal axis denotes DDIM sampling steps. DDIM sampler generates fine and stable results during the entire reverse process. This demonstrates that SDSeg can generate great results under limited steps of the reverse process.
  • ...and 1 more figures