Explanations of Large Language Models Explain Language Representations in the Brain

Maryam Rahimi; Yadollah Yaghoobzadeh; Mohammad Reza Daliri

Explanations of Large Language Models Explain Language Representations in the Brain

Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri

TL;DR

This work tackles how language processing in the brain relates to large language models by adopting explainable AI attribution methods to quantify how preceding words influence next-word predictions and by predicting fMRI responses during naturalistic storytelling. Using four attribution methods across GPT-2, Llama 2, and Phi-2, the authors build attribution- and layer-conductance-based feature spaces and show these explanations robustly predict brain activity across the language network, with early LM layers mapping to early brain processing and later layers to higher-level processing. Compared to internal representations like activations and attention, attribution-based explanations can achieve strong brain alignment with far fewer features and reveal a hierarchical correspondence between LM processing and neural language stages, including a remarkable voxel-level layer preference correlation ($r = 0.97$). The findings advance a neuroscience-informed framework for evaluating XAI explanations via brain data and suggest that attribution-based explanations offer a complementary lens on language processing that can extend to other cognitive domains and multimodal models. Overall, the study demonstrates the neural plausibility of explanatory signals and positions brain alignment as a rigorous objective for interpreting AI explanations.

Abstract

Large language models (LLMs) not only exhibit human-like performance but also share computational principles with the brain's language processing mechanisms. While prior research has focused on mapping LLMs' internal representations to neural activity, we propose a novel approach using explainable AI (XAI) to strengthen this link. Applying attribution methods, we quantify the influence of preceding words on LLMs' next-word predictions and use these explanations to predict fMRI data from participants listening to narratives. We find that attribution methods robustly predict brain activity across the language network, revealing a hierarchical pattern: explanations from early layers align with the brain's initial language processing stages, while later layers correspond to more advanced stages. Additionally, layers with greater influence on next-word prediction$\unicode{x2014}$reflected in higher attribution scores$\unicode{x2014}$demonstrate stronger brain alignment. These results underscore XAI's potential for exploring the neural basis of language and suggest brain alignment for assessing the biological plausibility of explanation methods.

Explanations of Large Language Models Explain Language Representations in the Brain

TL;DR

Abstract

Explanations of Large Language Models Explain Language Representations in the Brain

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)