Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis
Jungwoo Seo, David Keetae Park, Shinjae Yoo, Jiook Cha
TL;DR
This work tackles the challenge of voxelwise, whole-brain 4D task-fMRI synthesis under conditioning on cognitive tasks. It introduces a latent diffusion Transformer that couples 3D VQ-GAN compression with a hierarchical CNN–Transformer backbone and robust conditioning (AdaLN-Zero and cross-attention) to model high-dimensional spatio-temporal brain activity. Neuroscience-aligned evaluation on HCP task fMRI demonstrates faithful task-evoked activations, preserved inter-task representations (RSA), and perfect condition specificity at scale, outperforming a MONAI diffusion baseline. The findings reveal predictable scaling laws and offer a practical pathway for virtual experiments, cross-site harmonization, and principled data augmentation in downstream neuroimaging tasks.
Abstract
Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.
