Table of Contents
Fetching ...

Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis

Jungwoo Seo, David Keetae Park, Shinjae Yoo, Jiook Cha

TL;DR

This work tackles the challenge of voxelwise, whole-brain 4D task-fMRI synthesis under conditioning on cognitive tasks. It introduces a latent diffusion Transformer that couples 3D VQ-GAN compression with a hierarchical CNN–Transformer backbone and robust conditioning (AdaLN-Zero and cross-attention) to model high-dimensional spatio-temporal brain activity. Neuroscience-aligned evaluation on HCP task fMRI demonstrates faithful task-evoked activations, preserved inter-task representations (RSA), and perfect condition specificity at scale, outperforming a MONAI diffusion baseline. The findings reveal predictable scaling laws and offer a practical pathway for virtual experiments, cross-site harmonization, and principled data augmentation in downstream neuroimaging tasks.

Abstract

Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.

Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis

TL;DR

This work tackles the challenge of voxelwise, whole-brain 4D task-fMRI synthesis under conditioning on cognitive tasks. It introduces a latent diffusion Transformer that couples 3D VQ-GAN compression with a hierarchical CNN–Transformer backbone and robust conditioning (AdaLN-Zero and cross-attention) to model high-dimensional spatio-temporal brain activity. Neuroscience-aligned evaluation on HCP task fMRI demonstrates faithful task-evoked activations, preserved inter-task representations (RSA), and perfect condition specificity at scale, outperforming a MONAI diffusion baseline. The findings reveal predictable scaling laws and offer a practical pathway for virtual experiments, cross-site harmonization, and principled data augmentation in downstream neuroimaging tasks.

Abstract

Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview and sampled visual results compared against MONAI pinaya2023generative. (a) Our architecture follows recent advances in VQGAN kim2024adaptive and Latent Diffusion rombach2022high, equipped with joint conditioning for strong conditional generative performances. (b) Compared to the the closest baseline, MONAI, our model generates superior spatial details and (c) conditional temporal BOLD dynamics as revealed by a group-level GLM activation map (Section \ref{['sec:eval_metrics']}).
  • Figure 2: Generative performance as a function of model capacity. Performance is evaluated using three neuroscience-aligned metrics: (a) GLM activation map correlation, (b) Representational similarity analysis (RSA, Pearson), and (c) Condition specificity (Top-1 Accuracy).
  • Figure 3: Principal component analysis (PCA) of HCP task-fMRI volumes. (Left) Clustering by subject shows that individual differences dominate the data variance. (Middle) Phase encoding direction (LR vs. RL) explains the second-largest variability component. (Right) Task conditions contribute only weaker variance. These results indicate that subject- and acquisition-related factors obscure task-evoked signals, consistent with prior findings.
  • Figure 4: Detailed architecture of the proposed CNN–Transformer hybrid backbone. (a) UNet-style hierarchy integrating convolutional (yellow) and transformer (green) stages with downsampling/upsampling paths. (b) residual block with FiLM-based conditioning. (c) transformer block with AdaLN-Zero and cross-attention conditioning. This design balances between computational efficiency, inductive bias, and scalability.
  • Figure 5: Additional visual results compared against MONAI pinaya2023generative.
  • ...and 1 more figures