DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation
Jianwei Wang, Qing Wang, Menglan Ruan, Rongjun Ge, Chunfeng Yang, Yang Chen, Chunming Xie
TL;DR
This work tackles recovering functional information by generating mean BOLD images directly from structural T1-weighted scans. It introduces DINO-BOLDNet, a DINOv3-guided architecture that combines a frozen transformer-based structural encoder with a trainable multi-slice decoder and slice-wise attention to fuse cross-slice context, augmented by DINO-based perceptual supervision. The model employs a four-module framework and a composite loss to achieve high voxel-wise fidelity and structural similarity, achieving superior PSNR and MS-SSIM over a conditional GAN baseline on a 248-subject clinical dataset. The approach demonstrates the feasibility of structural-to-functional mapping using self-supervised transformer guidance, potentially enabling functional preprocessing when BOLD data are missing or corrupted.
Abstract
Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.
