Table of Contents
Fetching ...

SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

Simon Dahan, Gabriel Bénédict, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Robert Leech, Emma C. Robinson

TL;DR

SIM presents a surface-based approach to inter-subject multimodal decoding by projecting 7T fMRI onto cortical surfaces and encoding with a surface vision transformer (SiT). It couples vsMAE pretraining with tri-modal CLIP alignment across fMRI, video, and audio to enable retrieval and reconstruction of unseen movie clips from brain activity, generalizing to new subjects and scenes. The framework yields interpretable attention maps aligned with known brain networks and demonstrates substantial gains over baselines in cross-subject and cross-scene generalization, signaling potential for personalized brain simulations. This work advances scalable, cross-subject brain decoding and paves the way for digital twins of brain function in response to novel stimuli and tasks.

Abstract

Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code & pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim.

SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

TL;DR

SIM presents a surface-based approach to inter-subject multimodal decoding by projecting 7T fMRI onto cortical surfaces and encoding with a surface vision transformer (SiT). It couples vsMAE pretraining with tri-modal CLIP alignment across fMRI, video, and audio to enable retrieval and reconstruction of unseen movie clips from brain activity, generalizing to new subjects and scenes. The framework yields interpretable attention maps aligned with known brain networks and demonstrates substantial gains over baselines in cross-subject and cross-scene generalization, signaling potential for personalized brain simulations. This work advances scalable, cross-subject brain decoding and paves the way for digital twins of brain function in response to novel stimuli and tasks.

Abstract

Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code & pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim.

Paper Structure

This paper contains 43 sections, 20 figures, 8 tables.

Figures (20)

  • Figure 1: (SIM) seeks to align visual and audio stimuli - extracted from $T=3$ second movie clips - with brain activations acquired over the same time period : [A] 7T fMRI data, collected during movie watching, is first projected to subjects' native cortical surfaces (resolution $V$=59292 vertices); then inflated to a sphere, and downsampled onto a regularly tessellated icosphere ($I_6$ with $V$=40962 vertices). [B] This is then encoded using a surface vision transformer (SiT), which tokenizes data by patching each $I_6$ sphere with a lower-resolution icospheric grid to generate a sequence of triangular patches ($V$=45 vertices per patch; $N$=1280 number of patches). [C] The SiT encoder ($\Phi^{fMRI}_{enc}$) is pre-trained as part of a video surface masked autoencoder (vsMAE) for fMRI-frame reconstruction; [D] and fMRI embeddings ($f_i$) are then aligned with CLIP contrastive training to video ($v_i$) [E] and audio ($a_i$) [F] embeddings learnt from a videoMAE ($\Phi^{\mathcal{V}}_{enc}$) and wav2vec ($\Phi^{\mathcal{A}}_{enc}$) model; and projected to vector spaces of common lenght by multimodal mappers ($f^{fMRI}_\theta,f^{\mathcal{V}}_\theta,f^{\mathcal{A}}_\theta$) . At test time this makes it possible to decode video/audio stimuli from fMRI (or vice versa) through comparing the similarity of their CLIP embeddings ($f_i,a_i,v_i$). This comparison generates a probability distribution over the candidate samples, allowing for evaluation of retrieval performance using top-K accuracy metrics.
  • Figure 2: (a) Each movie-watching session (MOVIE1-4) is composed of movie scenes extracted from different movies (ranging in total from $1.4$ to $4.3mn$) and interleaved with $20s$ rest intervals. Movie scenes are further divided into $3s$movie clips for training and inference processes. (b) Overview of the three experimental setups: (i) Experiment 1: Subjects are divided into training and testing groups; training involves all movie clips while testing validates whether we can decode movie clips that were seen during training but using brain activations of new subjects from the test set. (ii) Experiment 2: utilises all subjects for both training and testing, but only the first half of each movie as training; we then validate on whether we can decode new clips from movie scenes that were not seen during training. (iii) Experiment 3: Training is limited to a subset of subjects (as in (i)) and only the first half of each movie (as in (ii)); the model is then tested on decoding new movie clips - from the last half of all movies - from the brain activations of new subjects not seen during training. Figure \ref{['figure:expdesign']} further clarifies the multilevel sampling terminology.
  • Figure 3: Video retrieval for Experiment 3 - generalisation to newmovie scenes and newsubjects - from: (a) soft-negative sampling: here, the reference movie clip is correctly retrieved as top1 and top-ranked movie clips all depict human faces; (b) hard-negative sampling: here the top-ranked movie clips all correspond to dialogue scenes.
  • Figure 4: Average attention maps for each attention head, extracted from the SiT encoder ($\Phi^{fMRI}_{enc}$), from a 3s movie clip of a dialogue scene during Ocean's Eleven. Maps are averaged across test subjects (in Experiment 1). Brain regions of importance for the movie-watching stimuli are highlighted and annotated based the HCP multimodal parcellation glasser2016human. Scene files will be made available on BALSA. Comparison with functional networks in Appendix \ref{['appendix:attention_networks']}.
  • Figure 5: Soft-negative$\mathrm{f_{MRI}} \rightarrow \mathcal{V}$ retrieval results for Experiment 2 and 3. Negative pairs (31) are sampled from different movies than the positive sample - but all sampled from newmovie scenes (not used during training) - showing generalisation to new stimuli for train subjects (Experiment 2) and new subjects (Experiment 3), as detailed in Figure \ref{['figure:movie_clips']}. Results (in %) with $\bar{\mu}$ and 95% conf. interval. Two-sample t-tests (Ridge VS SiT) with Bonferroni correction were highly significant ($p< 0.001$)
  • ...and 15 more figures