Table of Contents
Fetching ...

NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang, Dong Liang, Yihang Zhou

TL;DR

NeuroSwift tackles cross-subject fMRI-to-image reconstruction by coupling a structural latent pathway (AutoKL Adapter) with a semantic reinforcement pathway (CLIP Adapter) within a diffusion-based framework. It fine-tunes only 17% of parameters on new subjects after pretraining on a single subject, enabling ~1 hour of per-subject training on three RTX 4090 GPUs and using individualized ROI masks to reduce registration errors. The method achieves state-of-the-art performance in both spatial fidelity and semantic accuracy on complex scenes, outperforming existing cross-subject approaches while maintaining computational efficiency. This approach meaningfully advances real-time, resource-efficient brain decoding and highlights distinct neural substrates for low-level structure versus high-level semantics via interpretable adapter weights.

Abstract

Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

TL;DR

NeuroSwift tackles cross-subject fMRI-to-image reconstruction by coupling a structural latent pathway (AutoKL Adapter) with a semantic reinforcement pathway (CLIP Adapter) within a diffusion-based framework. It fine-tunes only 17% of parameters on new subjects after pretraining on a single subject, enabling ~1 hour of per-subject training on three RTX 4090 GPUs and using individualized ROI masks to reduce registration errors. The method achieves state-of-the-art performance in both spatial fidelity and semantic accuracy on complex scenes, outperforming existing cross-subject approaches while maintaining computational efficiency. This approach meaningfully advances real-time, resource-efficient brain decoding and highlights distinct neural substrates for low-level structure versus high-level semantics via interpretable adapter weights.

Abstract

Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) During scanning, subjects imagine semantic content rather than raw visual stimuli. Therefore, we leverage COCO Captions to emulate imagined semantics and Semantic Images to emulate imagined scenes. (b) Efficient cross‑subject adaptation. Pretrain on the first subject, then fine‑tune only 17% of parameters on the other subjects in one hour with 3×RTX4090 GPUs.
  • Figure 2: Overall structure of NeuroSwift in single‑subject mode. fMRI voxels are processed through hierarchical pipelines: (a) The structural generation pipeline transforms voxels into latent space representations($\mathit{z}_{{\textit{pred}}}$), which serve as the diffusion prior for noise addition($\mathit{z}_{{\tau}}$). (b) The semantic reinforcement pipeline projects voxels into CLIP's text ($\mathit{e}_{{\textit{txt\_pred}}}$) and image ($\mathit{e}_{{\textit{img\_pred}}}$) embeddings. Then, the denoising UNet iteratively refines the noise representation $\mathit{z}_{{\tau}}$ to integrate the CLIP semantic embeddings. Finally, the frozen AutoKL Decoder decodes $\mathit{z}_{{\textit{final}}}$ into the reconstructed image.
  • Figure 3: Examples of NeuroSwift reconstructions from complex visual stimuli.
  • Figure 4: Comparison of our framework, MindEye2mindeye2, and MindTunermindtuner using 1h of training data in cross‑subject adaptation mode.
  • Figure 5: Comparison of our framework with MindBridge13-5 using 40h of training data in single‑subject mode.
  • ...and 2 more figures