Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach
Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
TL;DR
This work demonstrates that predicting brain responses to natural speech requires nonlinear, multimodal encoding that fuses auditory and semantic representations. By integrating audio features from Whisper with semantic features from LLAMA and applying nonlinear encoders, the authors achieve substantial gains in predicting fMRI activity across cortex, outperforming unimodal linear models by approximately 17% and surpassing prior state-of-the-art methods by 7–14%. They introduce a spatiotemporal clustering approach using Relative Error Difference to reveal more coherent functional organization under nonlinear multimodal models and provide evidence that joint audio-semantic processing dominates cortical representations. The findings support neurolinguistic theories of distributed, multimodal integration and highlight the importance of nonlinear interactions for brain-aligned AI, while also noting limitations related to dataset size and interpretability that motivate future work.
Abstract
Self-supervised language and audio models effectively predict brain responses to speech. However, traditional prediction models rely on linear mappings from unimodal features, despite the complex integration of auditory signals with linguistic and semantic information across widespread brain networks during speech comprehension. Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models (e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in prediction performance (unnormalized and normalized correlation) over traditional unimodal linear models, as well as a 7.7% and 14.4% improvement, respectively, over prior state-of-the-art models. These improvements represent a major step towards future robust in-silico testing and improved decoding performance. They also reveal how auditory and semantic information are fused in motor, somatosensory, and higher-level semantic regions, aligning with existing neurolinguistic theories. Overall, our work highlights the often neglected potential of nonlinear and multimodal approaches to brain modeling, paving the way for future studies to embrace these strategies in naturalistic neurolinguistics research.
