Table of Contents
Fetching ...

Multi-modal brain encoding models for multi-modal stimuli

Subba Reddy Oota, Khushbu Pahwa, Mounika Marreddy, Maneesh Singh, Manish Gupta, Bapi S. Raju

TL;DR

The paper addresses how Transformer-based multi-modal representations align with human brain activity when subjects experience true multi-modal stimuli (video with audio). It compares cross-modal (ImageBind) and jointly pretrained (TVLT) multi-modal models against unimodal baselines using an fMRI Movie10 dataset, and employs residual analysis to dissect modality contributions. The results show that both multi-modal approaches improve alignment in language and visual regions, with cross-modal models relying more on video features and jointly pretrained models integrating both video and audio information. These findings advance understanding of how the brain processes multi-modal information and highlight the potential for interpreting multi-modal models in neuroscience contexts.

Abstract

Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI brain activity when participants are engaged in watching movies. We observe that both types of multi-modal models show improved alignment in several language and visual regions. This study also helps in identifying which brain regions process unimodal versus multi-modal information. We further investigate the contribution of each modality to multi-modal alignment by carefully removing unimodal features one by one from multi-modal representations, and find that there is additional information beyond the unimodal embeddings that is processed in the visual and language regions. Based on this investigation, we find that while for cross-modal models, their brain alignment is partially attributed to the video modality; for jointly pretrained models, it is partially attributed to both the video and audio modalities. This serves as a strong motivation for the neuroscience community to investigate the interpretability of these models for deepening our understanding of multi-modal information processing in brain.

Multi-modal brain encoding models for multi-modal stimuli

TL;DR

The paper addresses how Transformer-based multi-modal representations align with human brain activity when subjects experience true multi-modal stimuli (video with audio). It compares cross-modal (ImageBind) and jointly pretrained (TVLT) multi-modal models against unimodal baselines using an fMRI Movie10 dataset, and employs residual analysis to dissect modality contributions. The results show that both multi-modal approaches improve alignment in language and visual regions, with cross-modal models relying more on video features and jointly pretrained models integrating both video and audio information. These findings advance understanding of how the brain processes multi-modal information and highlight the potential for interpreting multi-modal models in neuroscience contexts.

Abstract

Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI brain activity when participants are engaged in watching movies. We observe that both types of multi-modal models show improved alignment in several language and visual regions. This study also helps in identifying which brain regions process unimodal versus multi-modal information. We further investigate the contribution of each modality to multi-modal alignment by carefully removing unimodal features one by one from multi-modal representations, and find that there is additional information beyond the unimodal embeddings that is processed in the visual and language regions. Based on this investigation, we find that while for cross-modal models, their brain alignment is partially attributed to the video modality; for jointly pretrained models, it is partially attributed to both the video and audio modalities. This serves as a strong motivation for the neuroscience community to investigate the interpretability of these models for deepening our understanding of multi-modal information processing in brain.

Paper Structure

This paper contains 28 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: (A) Overview of our proposed Multi-modal Brain Encoding Pipeline. Using fMRI recordings from participants watching popular movies included with speech, we align stimulus representations with brain recordings through ridge regression. For uni-modal alignment, we use representations from video models (VM) or speech models (SM), where the input consists exclusively of either videos (without speech) or speech, respectively. For multi-modal alignment, we leverage representations from cross-modal (CM) and jointly-pretrained models (JM), where the input consists of both video and speech. Here, $f_1$, $f_2$, $g$ and $h$ are ridge regression models. (B) Residual Analysis. First, we remove the uni-modal video model (VM) representations from the cross-modal (CM) representations by learning a simple linear function $r$ that maps VM representations to the CM representations, and use this estimated function to obtain the residual representations |CM(X)-$r$(VM(X))|. In step 2, we learn another ridge regression model ($g'$) to measure the brain alignment between residual representations |CM(X)-$r$(VM(X))| and the fMRI brain recordings. Similarly, residual analysis can also be applied to remove unimodal speech (SM) features from CM or JM representations for a given input X.
  • Figure 2: (Left) Avg normalized brain alignment of pretrained vs randomly initialized multi-modal and unimodal models across whole brain. $\times$$\implies$ pretrained model embeddings are significantly better than randomly initialized models, i.e., p$\leq 0.05$. (Right) Avg normalized brain alignment for both multi-modal and unimodal model features specifically within language and visual regions. Blue bar represents the normalized alignment using randomly generated vector embeddings. Error bars indicate the standard error of the mean across participants. $\ast$$\implies$ multi-modal embeddings are significantly better than unimodal video models (VM), i.e., p$\leq 0.05$. $\wedge$$\implies$ multi-modal embeddings are significantly better than unimodal speech models (SM), i.e., p$\leq 0.05$.
  • Figure 3: Average normalized brain alignment for video and audio modalities from multi-modal and individual modality features across whole brain and several ROIs of language (AG, PTL and IFG), visual (EVC, PPA and MT) and auditory cortex (AC). Error bars indicate the standard error of the mean across participants. $\ast$ indicates cases where multi-modal embeddings are significantly better than unimodal video models (VM), i.e., p$\leq 0.05$. $\wedge$ indicates cases where multi-modal embeddings are significantly better than unimodal speech models (SM), i.e., p$\leq 0.05$
  • Figure 4: Residual analysis: Average normalized brain alignment was computed across participants before and after removal of video and audio embeddings from both jointly pretrained and cross-modality models. Error bars indicate the standard error of the mean across participants. "-" symbol represents residuals.
  • Figure 5: Percent decrease of brain alignment after removal of unimodal embeddings from different multimodal models. (Left) Removal of unimodal VM embeddings from IB-Concat. (Middle) Removal of unimodal VM embeddings from jointly pretrained TVLT. (Right) Removal of unimodal SM embeddings from TVLT Joint. The color bar indicates the percent of decrease where darker shade of red denotes higher and white denotes zero. LH: Left Hemisphere and RH: Right Hemisphere.
  • ...and 10 more figures