Table of Contents
Fetching ...

NeuralDiffuser: Neuroscience-inspired Diffusion Guidance for fMRI Visual Reconstruction

Haoyu Li, Hao Wu, Badong Chen

TL;DR

NeuralDiffuser tackles the difficulty of reconstructing detailed visual stimuli from fMRI by introducing neuroscience-inspired diffusion guidance that injects bottom-up visual cues into a latent diffusion model. The method maps fMRI into a subject-shared space, aligns to both high-level semantic embeddings and low-level latents, and applies a primary visual feature guidance using multilayer CLIP image features decoded from fMRI with a two-stage training scheme. A novel guidance strategy with hyperparameters $\kappa$ (guidance scale) and $\eta$ (guided-step proportion) enhances detail fidelity while preserving semantic coherence and repeat-consistency. Experiments on the NSD dataset demonstrate improved detail-rich reconstructions and retrieval performance over state-of-the-art baselines, with ablations confirming the effectiveness and robustness of the guidance approach. This work advances brain decoding by integrating bottom-up visual cues into diffusion-based reconstruction, offering a framework that can extend to other reconstruction tasks and diffusion-guided modalities.

Abstract

Reconstructing visual stimuli from functional Magnetic Resonance Imaging fMRI enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies.

NeuralDiffuser: Neuroscience-inspired Diffusion Guidance for fMRI Visual Reconstruction

TL;DR

NeuralDiffuser tackles the difficulty of reconstructing detailed visual stimuli from fMRI by introducing neuroscience-inspired diffusion guidance that injects bottom-up visual cues into a latent diffusion model. The method maps fMRI into a subject-shared space, aligns to both high-level semantic embeddings and low-level latents, and applies a primary visual feature guidance using multilayer CLIP image features decoded from fMRI with a two-stage training scheme. A novel guidance strategy with hyperparameters (guidance scale) and (guided-step proportion) enhances detail fidelity while preserving semantic coherence and repeat-consistency. Experiments on the NSD dataset demonstrate improved detail-rich reconstructions and retrieval performance over state-of-the-art baselines, with ablations confirming the effectiveness and robustness of the guidance approach. This work advances brain decoding by integrating bottom-up visual cues into diffusion-based reconstruction, offering a framework that can extend to other reconstruction tasks and diffusion-guided modalities.

Abstract

Reconstructing visual stimuli from functional Magnetic Resonance Imaging fMRI enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies.
Paper Structure (25 sections, 14 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 14 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Schematic diagram of bottom-up and top-down processes in neuroscience. The perception of the visual scene is shaped by the reciprocal interaction of bottom-up perception, driven by visual cues from the retina (blue flows), and top-down creation, which incorporates prior knowledge and experience (red flows).
  • Figure 2: Reconstructed examples of diffusion-based (left) and NeuralDiffuser (right). Diffusion-based results are initialized by blurry initial images but lack detail cues, thereby tending to produce different 5 images. In contrast, NeuralDiffuser proposes primarily visual feature guidance, thereby obtaining faithful details and consistent 5 original images.
  • Figure 3: Reconstructed images using VAEs and GANs
  • Figure 4: Overview of NeuralDiffuser. During training, fMRI voxels are first mapped to a subject-shared space and then aligned with ground truth embeddings, namely CLIP text space, VAE's latent space and multilayers of CLIP image encoder. During inference, models run a forward and reverse diffusion process with or without guidance to generate reconstructed images.
  • Figure 5: Decoding performance of fMRI embeddings.
  • ...and 6 more figures