Table of Contents
Fetching ...

MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Image Segmentation

Ruirui Huang, Jiacheng Li

TL;DR

The paper tackles multi-organ medical image segmentation across modalities by introducing MedCondDiff, a diffusion-based approach conditioned on semantic priors. It integrates a lightweight adapter that injects hierarchical priors from a Pyramid Vision Transformer into the denoising network, producing anatomically faithful masks with reduced memory and faster inference. Key contributions include a unified adapter framework for diffusion conditioning, a PVT-based conditioning backbone, and empirical validation across abdominal CT and brain MRI datasets showing efficiency with competitive accuracy. The work demonstrates the practicality of semantically guided diffusion for medical imaging, particularly in resource-constrained settings and diverse modalities.

Abstract

We introduce MedCondDiff, a diffusion-based framework for multi-organ medical image segmentation that is efficient and anatomically grounded. The model conditions the denoising process on semantic priors extracted by a Pyramid Vision Transformer (PVT) backbone, yielding a semantically guided and lightweight diffusion architecture. This design improves robustness while reducing both inference time and VRAM usage compared to conventional diffusion models. Experiments on multi-organ, multi-modality datasets demonstrate that MedCondDiff delivers competitive performance across anatomical regions and imaging modalities, underscoring the potential of semantically guided diffusion models as an effective class of architectures for medical imaging tasks.

MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Image Segmentation

TL;DR

The paper tackles multi-organ medical image segmentation across modalities by introducing MedCondDiff, a diffusion-based approach conditioned on semantic priors. It integrates a lightweight adapter that injects hierarchical priors from a Pyramid Vision Transformer into the denoising network, producing anatomically faithful masks with reduced memory and faster inference. Key contributions include a unified adapter framework for diffusion conditioning, a PVT-based conditioning backbone, and empirical validation across abdominal CT and brain MRI datasets showing efficiency with competitive accuracy. The work demonstrates the practicality of semantically guided diffusion for medical imaging, particularly in resource-constrained settings and diverse modalities.

Abstract

We introduce MedCondDiff, a diffusion-based framework for multi-organ medical image segmentation that is efficient and anatomically grounded. The model conditions the denoising process on semantic priors extracted by a Pyramid Vision Transformer (PVT) backbone, yielding a semantically guided and lightweight diffusion architecture. This design improves robustness while reducing both inference time and VRAM usage compared to conventional diffusion models. Experiments on multi-organ, multi-modality datasets demonstrate that MedCondDiff delivers competitive performance across anatomical regions and imaging modalities, underscoring the potential of semantically guided diffusion models as an effective class of architectures for medical imaging tasks.

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: MedCondDiff Training framework. The Conditional Network (‘Adapter’) extracts features and injects them into the denoising network to enhance mask prediction.
  • Figure 2: MedCondDiff Conditioning framework. This ‘Adapter’ extracts features and injects them into the denoising network to enhance mask prediction. In the first layer of conditional network, block labeled with $'*'$ processes the noised image $\mathbf{x_t}$ and combines it with regular embedding.
  • Figure 3: Qualitative comparisons. MedCondDiff yields more accurate predictions with finer details and contours (Blue: false negative; Orange: false positive; Red: Hallucination).
  • Figure : Training (PVT-conditioned)