Table of Contents
Fetching ...

DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation

Jianwei Wang, Qing Wang, Menglan Ruan, Rongjun Ge, Chunfeng Yang, Yang Chen, Chunming Xie

TL;DR

This work tackles recovering functional information by generating mean BOLD images directly from structural T1-weighted scans. It introduces DINO-BOLDNet, a DINOv3-guided architecture that combines a frozen transformer-based structural encoder with a trainable multi-slice decoder and slice-wise attention to fuse cross-slice context, augmented by DINO-based perceptual supervision. The model employs a four-module framework and a composite loss to achieve high voxel-wise fidelity and structural similarity, achieving superior PSNR and MS-SSIM over a conditional GAN baseline on a 248-subject clinical dataset. The approach demonstrates the feasibility of structural-to-functional mapping using self-supervised transformer guidance, potentially enabling functional preprocessing when BOLD data are missing or corrupted.

Abstract

Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.

DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation

TL;DR

This work tackles recovering functional information by generating mean BOLD images directly from structural T1-weighted scans. It introduces DINO-BOLDNet, a DINOv3-guided architecture that combines a frozen transformer-based structural encoder with a trainable multi-slice decoder and slice-wise attention to fuse cross-slice context, augmented by DINO-based perceptual supervision. The model employs a four-module framework and a composite loss to achieve high voxel-wise fidelity and structural similarity, achieving superior PSNR and MS-SSIM over a conditional GAN baseline on a 248-subject clinical dataset. The approach demonstrates the feasibility of structural-to-functional mapping using self-supervised transformer guidance, potentially enabling functional preprocessing when BOLD data are missing or corrupted.

Abstract

Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.

Paper Structure

This paper contains 16 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the proposed DINOv3-guided T1-to-BOLD generation framework.(a) DBSE: DINO-based structural encoder; (b) SWAFM: slice-wise attention fusion module; (c) MSGD: multi-scale Generation Decoder; (d) DBPSM: DINO-based perceptual supervision module.
  • Figure 2: Visual comparison of T1-to-BOLD generation. Top: input T1, cGAN output, DINO-BOLDNet output, and ground-truth BOLD. Bottom: error maps (GT -- Pred), where red/blue denote positive/negative residuals. DINO-BOLDNet shows clearer anatomical detail and lower generation error.