Explanations of Large Language Models Explain Language Representations in the Brain
Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri
TL;DR
This work tackles how language processing in the brain relates to large language models by adopting explainable AI attribution methods to quantify how preceding words influence next-word predictions and by predicting fMRI responses during naturalistic storytelling. Using four attribution methods across GPT-2, Llama 2, and Phi-2, the authors build attribution- and layer-conductance-based feature spaces and show these explanations robustly predict brain activity across the language network, with early LM layers mapping to early brain processing and later layers to higher-level processing. Compared to internal representations like activations and attention, attribution-based explanations can achieve strong brain alignment with far fewer features and reveal a hierarchical correspondence between LM processing and neural language stages, including a remarkable voxel-level layer preference correlation ($r = 0.97$). The findings advance a neuroscience-informed framework for evaluating XAI explanations via brain data and suggest that attribution-based explanations offer a complementary lens on language processing that can extend to other cognitive domains and multimodal models. Overall, the study demonstrates the neural plausibility of explanatory signals and positions brain alignment as a rigorous objective for interpreting AI explanations.
Abstract
Large language models (LLMs) not only exhibit human-like performance but also share computational principles with the brain's language processing mechanisms. While prior research has focused on mapping LLMs' internal representations to neural activity, we propose a novel approach using explainable AI (XAI) to strengthen this link. Applying attribution methods, we quantify the influence of preceding words on LLMs' next-word predictions and use these explanations to predict fMRI data from participants listening to narratives. We find that attribution methods robustly predict brain activity across the language network, revealing a hierarchical pattern: explanations from early layers align with the brain's initial language processing stages, while later layers correspond to more advanced stages. Additionally, layers with greater influence on next-word prediction$\unicode{x2014}$reflected in higher attribution scores$\unicode{x2014}$demonstrate stronger brain alignment. These results underscore XAI's potential for exploring the neural basis of language and suggest brain alignment for assessing the biological plausibility of explanation methods.
