Table of Contents
Fetching ...

See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI

Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, Nanning Zheng

TL;DR

This work addresses the challenge of limited and noisy fMRI data for vision decoding by introducing STTM, a cross-subject framework that learns transferable neural representations through shallow subject adapters and a shared deep decoding core. It couples a high-level perception pipeline, aligned with CLIP visual tokens and textual descriptions, with a low-level pixel-wise reconstruction pathway guided by high-level semantic signals, and employs diffusion-prior models for image generation. The approach leverages multi-modal supervision (visual and textual) and two complementary contrastive losses (GVLC and FVC), along with an efficient diffusion prior, to yield robust cross-subject decoding and transfer learning to new subjects (GOD) under zero-shot settings. Empirical results on NSD and GOD show competitive or superior performance across retrieval, reconstruction, and zero-shot classification tasks, with strong evidence for the beneficial interaction between high-level semantic guidance and low-level pixel fidelity. The work provides a practical, transferable fMRI foundation model through cross-subject training and modular adapters, offering a scalable path toward cross-brain decoding and brain-inspired AI, while noting memory constraints due to per-subject adapters and the need for careful ethical considerations in deployment.

Abstract

Deciphering visual content from functional Magnetic Resonance Imaging (fMRI) helps illuminate the human vision system. However, the scarcity of fMRI data and noise hamper brain decoding model performance. Previous approaches primarily employ subject-specific models, sensitive to training sample size. In this paper, we explore a straightforward but overlooked solution to address data scarcity. We propose shallow subject-specific adapters to map cross-subject fMRI data into unified representations. Subsequently, a shared deeper decoding model decodes cross-subject features into the target feature space. During training, we leverage both visual and textual supervision for multi-modal brain decoding. Our model integrates a high-level perception decoding pipeline and a pixel-wise reconstruction pipeline guided by high-level perceptions, simulating bottom-up and top-down processes in neuroscience. Empirical experiments demonstrate robust neural representation learning across subjects for both pipelines. Moreover, merging high-level and low-level information improves both low-level and high-level reconstruction metrics. Additionally, we successfully transfer learned general knowledge to new subjects by training new adapters with limited training data. Compared to previous state-of-the-art methods, notably pre-training-based methods (Mind-Vis and fMRI-PTE), our approach achieves comparable or superior results across diverse tasks, showing promise as an alternative method for cross-subject fMRI data pre-training. Our code and pre-trained weights will be publicly released at https://github.com/YulongBonjour/See_Through_Their_Minds.

See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI

TL;DR

This work addresses the challenge of limited and noisy fMRI data for vision decoding by introducing STTM, a cross-subject framework that learns transferable neural representations through shallow subject adapters and a shared deep decoding core. It couples a high-level perception pipeline, aligned with CLIP visual tokens and textual descriptions, with a low-level pixel-wise reconstruction pathway guided by high-level semantic signals, and employs diffusion-prior models for image generation. The approach leverages multi-modal supervision (visual and textual) and two complementary contrastive losses (GVLC and FVC), along with an efficient diffusion prior, to yield robust cross-subject decoding and transfer learning to new subjects (GOD) under zero-shot settings. Empirical results on NSD and GOD show competitive or superior performance across retrieval, reconstruction, and zero-shot classification tasks, with strong evidence for the beneficial interaction between high-level semantic guidance and low-level pixel fidelity. The work provides a practical, transferable fMRI foundation model through cross-subject training and modular adapters, offering a scalable path toward cross-brain decoding and brain-inspired AI, while noting memory constraints due to per-subject adapters and the need for careful ethical considerations in deployment.

Abstract

Deciphering visual content from functional Magnetic Resonance Imaging (fMRI) helps illuminate the human vision system. However, the scarcity of fMRI data and noise hamper brain decoding model performance. Previous approaches primarily employ subject-specific models, sensitive to training sample size. In this paper, we explore a straightforward but overlooked solution to address data scarcity. We propose shallow subject-specific adapters to map cross-subject fMRI data into unified representations. Subsequently, a shared deeper decoding model decodes cross-subject features into the target feature space. During training, we leverage both visual and textual supervision for multi-modal brain decoding. Our model integrates a high-level perception decoding pipeline and a pixel-wise reconstruction pipeline guided by high-level perceptions, simulating bottom-up and top-down processes in neuroscience. Empirical experiments demonstrate robust neural representation learning across subjects for both pipelines. Moreover, merging high-level and low-level information improves both low-level and high-level reconstruction metrics. Additionally, we successfully transfer learned general knowledge to new subjects by training new adapters with limited training data. Compared to previous state-of-the-art methods, notably pre-training-based methods (Mind-Vis and fMRI-PTE), our approach achieves comparable or superior results across diverse tasks, showing promise as an alternative method for cross-subject fMRI data pre-training. Our code and pre-trained weights will be publicly released at https://github.com/YulongBonjour/See_Through_Their_Minds.
Paper Structure (45 sections, 6 equations, 8 figures, 10 tables)

This paper contains 45 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of our STTM framework, which consists of a high-level perception decoding pipeline and a pixel-wise reconstruction pipeline(low-level pipeline). The two pipelines are trained sequentially. The pixel-wise reconstruction pipeline is guided by the high-level pipeline. The final reconstructions are generated in an img2img settingmeng2021sdedit using versatile diffusion modelxu2023versatile. Subject adapters are used to transform cross-subject fMRI data into a unified feature space for the two pipelines respectively. For new subjects, transfer learning can be conducted by training new adapters. The whole framework is inspired by the bottom-up and top-down processes in neuroscience.
  • Figure 2: Reconstruction examples for subject 1 in the NSD dataset. The "Low reconstructions" are from the low-level pipeline and the "final reconstructions" are obtained in the img2img meng2021sdedit setting.
  • Figure 3: Reconstruction examples for subject 3 in GOD
  • Figure A1: Utilizing the imagery experiment data provided in the Generic Object Decoding (GOD) datasethorikawa2017generic, we try to visualize mental images from brain activities in the visual cortex. These brain activities were recorded while subjects freely imagined an object prompted by text cues, with their eyes closed, thus lacking ground truth images for comparison. Importantly, the object categories used in this experiment do not overlap with the GOD training set. To generate these visualizations, we averaged the fMRI patterns across 10 trials for each category. We present each category name alongside a reference image (on the left) and a generated image (on the right) to illustrate the decoded mental imagery. These results, obtained with subject 3, utilize exactly the same STTM-H model as detailed in the main body of this paper.
  • Figure A2: Qualitative comparison with state-of-the-art methods on the GOD dataset. From the intuitive point of view, our method gets similar or even better results compared to CMVDMzeng2023controllable and Mind-Vischen2023seeing.
  • ...and 3 more figures