See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI
Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, Nanning Zheng
TL;DR
This work addresses the challenge of limited and noisy fMRI data for vision decoding by introducing STTM, a cross-subject framework that learns transferable neural representations through shallow subject adapters and a shared deep decoding core. It couples a high-level perception pipeline, aligned with CLIP visual tokens and textual descriptions, with a low-level pixel-wise reconstruction pathway guided by high-level semantic signals, and employs diffusion-prior models for image generation. The approach leverages multi-modal supervision (visual and textual) and two complementary contrastive losses (GVLC and FVC), along with an efficient diffusion prior, to yield robust cross-subject decoding and transfer learning to new subjects (GOD) under zero-shot settings. Empirical results on NSD and GOD show competitive or superior performance across retrieval, reconstruction, and zero-shot classification tasks, with strong evidence for the beneficial interaction between high-level semantic guidance and low-level pixel fidelity. The work provides a practical, transferable fMRI foundation model through cross-subject training and modular adapters, offering a scalable path toward cross-brain decoding and brain-inspired AI, while noting memory constraints due to per-subject adapters and the need for careful ethical considerations in deployment.
Abstract
Deciphering visual content from functional Magnetic Resonance Imaging (fMRI) helps illuminate the human vision system. However, the scarcity of fMRI data and noise hamper brain decoding model performance. Previous approaches primarily employ subject-specific models, sensitive to training sample size. In this paper, we explore a straightforward but overlooked solution to address data scarcity. We propose shallow subject-specific adapters to map cross-subject fMRI data into unified representations. Subsequently, a shared deeper decoding model decodes cross-subject features into the target feature space. During training, we leverage both visual and textual supervision for multi-modal brain decoding. Our model integrates a high-level perception decoding pipeline and a pixel-wise reconstruction pipeline guided by high-level perceptions, simulating bottom-up and top-down processes in neuroscience. Empirical experiments demonstrate robust neural representation learning across subjects for both pipelines. Moreover, merging high-level and low-level information improves both low-level and high-level reconstruction metrics. Additionally, we successfully transfer learned general knowledge to new subjects by training new adapters with limited training data. Compared to previous state-of-the-art methods, notably pre-training-based methods (Mind-Vis and fMRI-PTE), our approach achieves comparable or superior results across diverse tasks, showing promise as an alternative method for cross-subject fMRI data pre-training. Our code and pre-trained weights will be publicly released at https://github.com/YulongBonjour/See_Through_Their_Minds.
