Table of Contents
Fetching ...

Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach

Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee

TL;DR

This work demonstrates that predicting brain responses to natural speech requires nonlinear, multimodal encoding that fuses auditory and semantic representations. By integrating audio features from Whisper with semantic features from LLAMA and applying nonlinear encoders, the authors achieve substantial gains in predicting fMRI activity across cortex, outperforming unimodal linear models by approximately 17% and surpassing prior state-of-the-art methods by 7–14%. They introduce a spatiotemporal clustering approach using Relative Error Difference to reveal more coherent functional organization under nonlinear multimodal models and provide evidence that joint audio-semantic processing dominates cortical representations. The findings support neurolinguistic theories of distributed, multimodal integration and highlight the importance of nonlinear interactions for brain-aligned AI, while also noting limitations related to dataset size and interpretability that motivate future work.

Abstract

Self-supervised language and audio models effectively predict brain responses to speech. However, traditional prediction models rely on linear mappings from unimodal features, despite the complex integration of auditory signals with linguistic and semantic information across widespread brain networks during speech comprehension. Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models (e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in prediction performance (unnormalized and normalized correlation) over traditional unimodal linear models, as well as a 7.7% and 14.4% improvement, respectively, over prior state-of-the-art models. These improvements represent a major step towards future robust in-silico testing and improved decoding performance. They also reveal how auditory and semantic information are fused in motor, somatosensory, and higher-level semantic regions, aligning with existing neurolinguistic theories. Overall, our work highlights the often neglected potential of nonlinear and multimodal approaches to brain modeling, paving the way for future studies to embrace these strategies in naturalistic neurolinguistics research.

Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach

TL;DR

This work demonstrates that predicting brain responses to natural speech requires nonlinear, multimodal encoding that fuses auditory and semantic representations. By integrating audio features from Whisper with semantic features from LLAMA and applying nonlinear encoders, the authors achieve substantial gains in predicting fMRI activity across cortex, outperforming unimodal linear models by approximately 17% and surpassing prior state-of-the-art methods by 7–14%. They introduce a spatiotemporal clustering approach using Relative Error Difference to reveal more coherent functional organization under nonlinear multimodal models and provide evidence that joint audio-semantic processing dominates cortical representations. The findings support neurolinguistic theories of distributed, multimodal integration and highlight the importance of nonlinear interactions for brain-aligned AI, while also noting limitations related to dataset size and interpretability that motivate future work.

Abstract

Self-supervised language and audio models effectively predict brain responses to speech. However, traditional prediction models rely on linear mappings from unimodal features, despite the complex integration of auditory signals with linguistic and semantic information across widespread brain networks during speech comprehension. Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models (e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in prediction performance (unnormalized and normalized correlation) over traditional unimodal linear models, as well as a 7.7% and 14.4% improvement, respectively, over prior state-of-the-art models. These improvements represent a major step towards future robust in-silico testing and improved decoding performance. They also reveal how auditory and semantic information are fused in motor, somatosensory, and higher-level semantic regions, aligning with existing neurolinguistic theories. Overall, our work highlights the often neglected potential of nonlinear and multimodal approaches to brain modeling, paving the way for future studies to embrace these strategies in naturalistic neurolinguistics research.

Paper Structure

This paper contains 52 sections, 2 equations, 40 figures, 3 tables.

Figures (40)

  • Figure 1: Spatio-temporal clustering analysis: (a,b) Functional connectivity matrix and hierarchical clustering dendrogram from raw fMRI correlations. (c,d) Correlation matrices and dendrograms from Relative Error Difference (RED) between semantic and audio encoding models using MLP encoders. Matrix values indicate regional similarity. Hierarchical clustering reveals brain region organization by response profiles. The nonlinear models (d) show clearer functional groupings than standard connectivity (b), quantified by higher modularity scores (see main text).
  • Figure 2: Multimodality improvement in encoding models. Panels (a)-(d) display voxelwise $\Delta r$ values of a single subject, with warmer colors indicating regions where multimodal models outperform linear models. Each panel corresponds to the difference between voxel-wise predictions of the model in the corresponding column and the model in the corresponding row. E.g., panel (a) shows the difference between the Multimodal Linear and Semantic Linear models. (e) Box plot showing $\Delta r$ across different regions of interest (ROIs), where the $\Delta r$ values are aggregated over all subjects. mult and sem each refer to multimodal and semantic encoders. Asterisk* indicate ROI where $\Delta r > 0$ is statistically significant (p $< 0.05$). ROIs are grouped and color-coded by their functions. The boxes represent the range between the 25th and 75th percentiles, with the line inside showing the median. Whiskers extend to 1.5 times this range. (A complete list of ROI abbreviations are at Appendix \ref{['appendix abbreviations']}. Voxelwise and ROI-wise plots for each subject are in Figure \ref{['Fig all subs multimodal voxelwise']}, \ref{['Fig all subs multimodal voxelwise_2']}, and \ref{['Fig2 ROI subwise']} in the Appendix).
  • Figure 3: Visualization of most dominant feature type in brain activity predictions from variance partitioning analysis. (a) Voxel-wise plots from a single subject (S1) and (b) ROI-wise Venn diagrams showing which feature type (semantic: red, audio: green, joint: blue) explains the largest variance for each significantly predicted voxel ($q(\text{FDR})<0.01$) using MLP encoders. ROI results are aggregated across subjects with numbers indicating voxel percentages and counts.
  • Figure 4: Heatmap showing average $r^2$ values for different combinations of LLAMA and Whisper layer depths using an MLP encoder. Darker colors represent higher performance, with the best results obtained when the best layers in the respective uni-modal encoding models were used.
  • Figure 5: Encoder performance across different LLAMA and Whisper model variants, using linear regression applied to the full set of voxels. Panel (a) compares LLAMA models of various architectures (LLAMA-2 and LLAMA-3) with 7B and 8B parameters. Panel (b) presents performance across different LLAMA models of increasing sizes, from 7B to 65B. Panels (c) and (d) show the performance for different Whisper model variants, including comparisons between Whisper Large versions (c) and different model sizes (d), from Whisper Tiny to Whisper Large. Performance is measured in terms of average $r^2$, plotted against normalized layer depth.
  • ...and 35 more figures