CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding
Qiongyi Zhou, Changde Du, Shengpei Wang, Huiguang He
TL;DR
This work tackles the challenge of generalizing visual neural decoding across multiple subjects in the presence of limited data per subject. It introduces CLIP-MUSED, a Transformer-based fMRI feature extractor that uses learnable subject-specific low-level and high-level tokens, guided by representational similarity analysis (RSA) to the topological structure of CLIP stimulus representations. By aligning multi-subject neural responses to a shared space and employing both low-level and high-level semantic guidance, CLIP-MUSED achieves state-of-the-art performance among multi-subject methods on HCP and NSD datasets, while maintaining parameter efficiency that does not scale linearly with the number of subjects. The approach offers insights into inter-subject neural variability, supports scalable aggregation of large datasets, and points toward foundation-model-style principles for neuroimaging analysis.
Abstract
The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method consists of a Transformer-based feature extractor to effectively model global neural representations. It also incorporates learnable subject-specific tokens that facilitates the aggregation of multi-subject data without a linear increase of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED.
