Table of Contents
Fetching ...

VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian André, Sandy Engelhardt

Abstract

Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Abstract

Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the VolDiT framework. Input volumes are encoded into latents, patchified into tokens, and processed by the 3D Diffusion Transformer for denoising, with spatial conditioning injected through the timestep-gated control adapter. The transformer outputs are then unpatchified and decoded to reconstruct volumetric images.
  • Figure 2: Synthetic examples on LUNA16 and TaviCT. HA-GAN produces anatomically implausible structures, while the U-Net LDM yields high-quality images but struggles with global anatomical consistency (red circles). VolDiT generates anatomically coherent volumes with sharper structural detail, reflecting the benefit of global self-attention.
  • Figure 3: Conditionally generated TaviCT samples based on held-out test masks of heart structures (blue) and the aorta (green). The U-Net-based model shows less alignment with input masks and does not preserve anatomical realism when enforcing the condition (highlighted by red squares).