Table of Contents
Fetching ...

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

Yanchen Wang, Adam Turnbull, Tiange Xiang, Yunlong Xu, Sa Zhou, Adnan Masoud, Shekoofeh Azizi, Feng Vankee Lin, Ehsan Adeli

TL;DR

This work expands neural decoding from a predominantly visual-cortex focus to whole-brain mapping by introducing WAVE, a framework that leverages an fMRI foundation model and a diffusion-based generator trained with multi-modal contrastive learning. By decoding visual experiences across the entire cortex, WAVE achieves superior semantic reconstruction and reveals that high-level networks, especially the Default Mode Network and the Dorsal Attention Network, play crucial roles beyond early visual areas. The approach also demonstrates zero-shot generalization to imagined scenarios, and a post-hoc semantic analysis links visual clusters to distributed brain networks, offering interpretable insights into brain–behavior relationships. Taken together, the results underscore the potential of brain foundation models to democratize complex brain-behavior analyses in smaller datasets while highlighting the distributed nature of visual cognition and semantic processing.

Abstract

Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advances in functional Magnetic Resonance Imaging (fMRI) and machine learning have greatly improved our ability to map visual stimuli to brain activity, especially in the visual cortex. Concurrently, research has expanded to decode more complex processes, such as language and memory across the whole brain, using techniques to handle greater variability and improve signal accuracy. We argue that "seeing" involves more than just mapping visual stimuli onto the visual cortex; it engages the entire brain, as various emotions and cognitive states can emerge from observing different scenes. In this paper, we develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps while individuals are exposed to visual stimuli. We utilize transformer-based large-scale fMRI encoders and Image generative models (encoders & decoders) pre-trained on large public datasets, which are then fine-tuned through Image-fMRI contrastive learning. Our models can decode visual experience across the entire cerebral cortex, surpassing the traditional confines of the visual cortex. Using a public dataset (BOLD5000), we first compare our method with state-of-the-art approaches for decoding visual processing and show improved predictive semantic accuracy by 43%. A network ablation analysis suggests that, beyond the visual cortex, the default mode network contributes significantly to stimulus decoding, in line with the proposed role of this network in sense-making and semantic processing.

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

TL;DR

This work expands neural decoding from a predominantly visual-cortex focus to whole-brain mapping by introducing WAVE, a framework that leverages an fMRI foundation model and a diffusion-based generator trained with multi-modal contrastive learning. By decoding visual experiences across the entire cortex, WAVE achieves superior semantic reconstruction and reveals that high-level networks, especially the Default Mode Network and the Dorsal Attention Network, play crucial roles beyond early visual areas. The approach also demonstrates zero-shot generalization to imagined scenarios, and a post-hoc semantic analysis links visual clusters to distributed brain networks, offering interpretable insights into brain–behavior relationships. Taken together, the results underscore the potential of brain foundation models to democratize complex brain-behavior analyses in smaller datasets while highlighting the distributed nature of visual cognition and semantic processing.

Abstract

Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advances in functional Magnetic Resonance Imaging (fMRI) and machine learning have greatly improved our ability to map visual stimuli to brain activity, especially in the visual cortex. Concurrently, research has expanded to decode more complex processes, such as language and memory across the whole brain, using techniques to handle greater variability and improve signal accuracy. We argue that "seeing" involves more than just mapping visual stimuli onto the visual cortex; it engages the entire brain, as various emotions and cognitive states can emerge from observing different scenes. In this paper, we develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps while individuals are exposed to visual stimuli. We utilize transformer-based large-scale fMRI encoders and Image generative models (encoders & decoders) pre-trained on large public datasets, which are then fine-tuned through Image-fMRI contrastive learning. Our models can decode visual experience across the entire cerebral cortex, surpassing the traditional confines of the visual cortex. Using a public dataset (BOLD5000), we first compare our method with state-of-the-art approaches for decoding visual processing and show improved predictive semantic accuracy by 43%. A network ablation analysis suggests that, beyond the visual cortex, the default mode network contributes significantly to stimulus decoding, in line with the proposed role of this network in sense-making and semantic processing.

Paper Structure

This paper contains 39 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Model Pipeline and performance comparison with previous work.a, Study Formation: The diagram illustrates our model, WAVE, which reconstructs visual stimuli from fMRI data. b, Model Training Framework: After preprocessing the raw fMRI data, WAVE integrates three modalities: fMRI, image, and text to perform contrastive learning. The features are then passed to a diffusion model for final image reconstruction. c, Brain Saliency Maps: The saliency maps compare the input fMRI data between Mind-Vis and WAVE. Mind-Vis utilizes only visual cortex data (highlighted voxels in red), whereas WAVE employs whole-brain fMRI data. The saliency map for WAVE is based on the model's attention, demonstrating a broader and more comprehensive engagement of brain regions. d, Quantitative Analysis: This graph compares the performance of the universal model in reconstructing images between Mind-Vis and WAVE. The y-axis represents CLIP accuracy radford2021learning from high-level metrics, and the x-axis represents AlexNet(5) krizhevsky2012imagenet accuracy from low-level metrics ozcelik2023natural. The distributions are displayed on the sides, along with sample images, showing the superior performance of WAVE in capturing both high-level and low-level features.
  • Figure 2: Comparative Evaluation and Network Ablation of the WAVE Framework. a, Qualitative comparison of reconstructed images. Columns display the original visual stimulus alongside reconstructions from WAVE and baselines. b, Universal settings across all four subjects in the BOLD5000 dataset. This panel demonstrates the generalization capability of WAVE, MindEye, and Mind-Vis among different subjects. c, The saliency map of the WAVE model using data where the visual cortex has been masked, showing the top 20 regions of interests. d, Decoding from the visual cortex-masked model: examples of reconstructed images using non-visual fMRI regions. In each pair, the left image shows the original visual stimulus, and the right image shows the reconstructed image generated by our visual cortex-masked method (WAVE). e, Impact of network ablation on Decoding Accuracy (subject CSI-1). The left side displays generated images resulting from the masking of each of the seven networks. The right side features a box plot illustrating the decoding accuracy for each network ablation, highlighting the accuracy reduction when specific networks are removed. The red dashed line above represents the whole-brain decoding accuracy for comparison.
  • Figure 3: Semantic Profiling of Whole-Brain Visual Representations.a, t-SNE projection of image embeddings, colored by five distinct semantic clusters identified via K-Means. b-f, Detailed profiles for each cluster. Word clouds illustrate the most frequent object labels within each group The subfigure titles, generated by entering the words into ChatGPT-4, summarize the thematic essence of each cluster. Accompanying each word cloud are selected image samples and a whole-brain saliency map highlighting the top 20 regions of interest relevant to the cluster. g, Network-level decomposition of the saliency maps. The bar chart quantifies quantifies the distribution of top regions across Yeo-7, providing insights into the network-based localization of visual processing associated with different categories.
  • Figure 4: Generalization to Zero-Shot Mental Imagery. a, WAVE performance on an independent dataset ($N=24$), where participants imagined scenarios based on verbal prompts. This analysis measures the cosine distance of each scenario's features to those in the training dataset (BOLD5000). The scatter plot shows the correlation of these cosine distances with decoding accuracy: scenarios that are more similar to the training dataset had more accurate predictions. b, Example of WAVE model zero-shot reconstructed images from imagination recording fMRI sessions. The text stimuli describing the museum scenario are presented above the example image, illustrating the model's capability to generate visual reconstructions based on described scenarios.
  • Figure 5: The depicted architecture illustrates the fMRI data processing and a two-part training approach for the model. a, fMRI preprocessing involving network parcellations and BOLD signal segmentation. b, Focus on contrastive learning where knowledge is distilled across three modalities: fMRI, text, and images. c, Training of the diffusion model, which involves fine-tuning a specialized prior to converting fMRI latent representations into image latent variables. Icons of fire and snowflake denote modules that are active (train) and inactive (frozen) during the training phase, respectively.
  • ...and 2 more figures