CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

Qiongyi Zhou; Changde Du; Shengpei Wang; Huiguang He

CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

Qiongyi Zhou, Changde Du, Shengpei Wang, Huiguang He

TL;DR

This work tackles the challenge of generalizing visual neural decoding across multiple subjects in the presence of limited data per subject. It introduces CLIP-MUSED, a Transformer-based fMRI feature extractor that uses learnable subject-specific low-level and high-level tokens, guided by representational similarity analysis (RSA) to the topological structure of CLIP stimulus representations. By aligning multi-subject neural responses to a shared space and employing both low-level and high-level semantic guidance, CLIP-MUSED achieves state-of-the-art performance among multi-subject methods on HCP and NSD datasets, while maintaining parameter efficiency that does not scale linearly with the number of subjects. The approach offers insights into inter-subject neural variability, supports scalable aggregation of large datasets, and points toward foundation-model-style principles for neuroimaging analysis.

Abstract

The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method consists of a Transformer-based feature extractor to effectively model global neural representations. It also incorporates learnable subject-specific tokens that facilitates the aggregation of multi-subject data without a linear increase of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED.

CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 9 figures, 6 tables)

This paper contains 24 sections, 7 equations, 9 figures, 6 tables.

Introduction
Methodology
Overview
CLIP-based feature extraction of visual stimuli
Transformer-based fMRI feature extraction
Multi-subject shared neural response representation
Semantic classifier
Optimization objective
Experiments
Datasets
Baseline methods
Parameter settings
Evaluation metrics
Comparative experimental results
Ablation study
...and 9 more sections

Figures (9)

Figure 1: Diagram of the different multi-subject functional alignment methods.
Figure 2: The framework of the proposed method. Left: Low-level and high-level feature RSM of visual stimuli are obtained from CLIP at first. Right: The Transformer-based encoder extracts multi-subject shared neural representations guided by the visual stimulus feature RSM.
Figure 3: Transformer-based fMRI feature extractor of CLIP-MUSED. (a) Conversion of BOLD signals in Volume format to BOLD patches. (b) Flowchart of the feature extraction process. (c) Network structure of the Transformer encoder.
Figure 4: On the HCP dataset, attention maps of (a) low-level tokens and (b) high-level tokens of our method, and (c) attention maps of subject embeddings of the MS-EMB method, were visualized on the cortical surface for 4 randomly selected subjects.
Figure C5: Performance comparison between SS-ViT and our method on each subject of the HCP dataset.
...and 4 more figures

CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

TL;DR

Abstract

CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (9)