Table of Contents
Fetching ...

3D MedDiffusion: A 3D Medical Latent Diffusion Model for Controllable and High-quality Medical Image Generation

Haoshen Wang, Zhentao Liu, Kaicong Sun, Xiaodong Wang, Dinggang Shen, Zhiming Cui

TL;DR

3D MedDiffusion tackles the challenge of high-quality 3D medical image generation by introducing a Patch-Volume Autoencoder for memory-efficient latent compression and BiFlowNet as a dual-flow noise estimator for diffusion in latent space. The framework, augmented with ControlNet for task conditioning, achieves state-of-the-art fidelity while supporting diverse downstream tasks such as sparse-view CT, fast MRI, and data augmentation for segmentation and classification. Extensive experiments across six CT/MRI datasets, ablations, and a radiologist study demonstrate superior generative quality and strong generalization, with practical efficiency considerations. The work enables controllable, high-resolution 3D medical synthesis and practical adaptation to clinical pipelines, while outlining future directions for arbitrary-size generation and conditioning factors.

Abstract

The generation of medical images presents significant challenges due to their high-resolution and three-dimensional nature. Existing methods often yield suboptimal performance in generating high-quality 3D medical images, and there is currently no universal generative framework for medical imaging. In this paper, we introduce a 3D Medical Latent Diffusion (3D MedDiffusion) model for controllable, high-quality 3D medical image generation. 3D MedDiffusion incorporates a novel, highly efficient Patch-Volume Autoencoder that compresses medical images into latent space through patch-wise encoding and recovers back into image space through volume-wise decoding. Additionally, we design a new noise estimator to capture both local details and global structural information during diffusion denoising process. 3D MedDiffusion can generate fine-detailed, high-resolution images (up to 512x512x512) and effectively adapt to various downstream tasks as it is trained on large-scale datasets covering CT and MRI modalities and different anatomical regions (from head to leg). Experimental results demonstrate that 3D MedDiffusion surpasses state-of-the-art methods in generative quality and exhibits strong generalizability across tasks such as sparse-view CT reconstruction, fast MRI reconstruction, and data augmentation for segmentation and classification. Source code and checkpoints are available at https://github.com/ShanghaiTech-IMPACT/3D-MedDiffusion.

3D MedDiffusion: A 3D Medical Latent Diffusion Model for Controllable and High-quality Medical Image Generation

TL;DR

3D MedDiffusion tackles the challenge of high-quality 3D medical image generation by introducing a Patch-Volume Autoencoder for memory-efficient latent compression and BiFlowNet as a dual-flow noise estimator for diffusion in latent space. The framework, augmented with ControlNet for task conditioning, achieves state-of-the-art fidelity while supporting diverse downstream tasks such as sparse-view CT, fast MRI, and data augmentation for segmentation and classification. Extensive experiments across six CT/MRI datasets, ablations, and a radiologist study demonstrate superior generative quality and strong generalization, with practical efficiency considerations. The work enables controllable, high-resolution 3D medical synthesis and practical adaptation to clinical pipelines, while outlining future directions for arbitrary-size generation and conditioning factors.

Abstract

The generation of medical images presents significant challenges due to their high-resolution and three-dimensional nature. Existing methods often yield suboptimal performance in generating high-quality 3D medical images, and there is currently no universal generative framework for medical imaging. In this paper, we introduce a 3D Medical Latent Diffusion (3D MedDiffusion) model for controllable, high-quality 3D medical image generation. 3D MedDiffusion incorporates a novel, highly efficient Patch-Volume Autoencoder that compresses medical images into latent space through patch-wise encoding and recovers back into image space through volume-wise decoding. Additionally, we design a new noise estimator to capture both local details and global structural information during diffusion denoising process. 3D MedDiffusion can generate fine-detailed, high-resolution images (up to 512x512x512) and effectively adapt to various downstream tasks as it is trained on large-scale datasets covering CT and MRI modalities and different anatomical regions (from head to leg). Experimental results demonstrate that 3D MedDiffusion surpasses state-of-the-art methods in generative quality and exhibits strong generalizability across tasks such as sparse-view CT reconstruction, fast MRI reconstruction, and data augmentation for segmentation and classification. Source code and checkpoints are available at https://github.com/ShanghaiTech-IMPACT/3D-MedDiffusion.

Paper Structure

This paper contains 33 sections, 11 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Patch-Volume Autoencoder with a two-stage training strategy. In the first stage, the model is trained solely to compress and reconstruct small patches from high-resolution volumes. In the second stage, all parameters are fixed except for the decoder, which is fine-tuned on high-resolution volumes to become a joint decoder.
  • Figure 2: BiFlowNet noise estimator. The intra-patch flow focuses on denoising each patch and recovering fine-grained local details, while the inter-patch flow is designed to capture and reconstruct the global structures across the entire volume.
  • Figure 3: The network architecture of ControlNet
  • Figure 4: Qualitative results on CTChestAbdomen. Window: [-1000,1000] HU.
  • Figure 5: Qualitative results on MRBrain.
  • ...and 8 more figures