Table of Contents
Fetching ...

Explanations of Large Language Models Explain Language Representations in the Brain

Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri

TL;DR

This work tackles how language processing in the brain relates to large language models by adopting explainable AI attribution methods to quantify how preceding words influence next-word predictions and by predicting fMRI responses during naturalistic storytelling. Using four attribution methods across GPT-2, Llama 2, and Phi-2, the authors build attribution- and layer-conductance-based feature spaces and show these explanations robustly predict brain activity across the language network, with early LM layers mapping to early brain processing and later layers to higher-level processing. Compared to internal representations like activations and attention, attribution-based explanations can achieve strong brain alignment with far fewer features and reveal a hierarchical correspondence between LM processing and neural language stages, including a remarkable voxel-level layer preference correlation ($r = 0.97$). The findings advance a neuroscience-informed framework for evaluating XAI explanations via brain data and suggest that attribution-based explanations offer a complementary lens on language processing that can extend to other cognitive domains and multimodal models. Overall, the study demonstrates the neural plausibility of explanatory signals and positions brain alignment as a rigorous objective for interpreting AI explanations.

Abstract

Large language models (LLMs) not only exhibit human-like performance but also share computational principles with the brain's language processing mechanisms. While prior research has focused on mapping LLMs' internal representations to neural activity, we propose a novel approach using explainable AI (XAI) to strengthen this link. Applying attribution methods, we quantify the influence of preceding words on LLMs' next-word predictions and use these explanations to predict fMRI data from participants listening to narratives. We find that attribution methods robustly predict brain activity across the language network, revealing a hierarchical pattern: explanations from early layers align with the brain's initial language processing stages, while later layers correspond to more advanced stages. Additionally, layers with greater influence on next-word prediction$\unicode{x2014}$reflected in higher attribution scores$\unicode{x2014}$demonstrate stronger brain alignment. These results underscore XAI's potential for exploring the neural basis of language and suggest brain alignment for assessing the biological plausibility of explanation methods.

Explanations of Large Language Models Explain Language Representations in the Brain

TL;DR

This work tackles how language processing in the brain relates to large language models by adopting explainable AI attribution methods to quantify how preceding words influence next-word predictions and by predicting fMRI responses during naturalistic storytelling. Using four attribution methods across GPT-2, Llama 2, and Phi-2, the authors build attribution- and layer-conductance-based feature spaces and show these explanations robustly predict brain activity across the language network, with early LM layers mapping to early brain processing and later layers to higher-level processing. Compared to internal representations like activations and attention, attribution-based explanations can achieve strong brain alignment with far fewer features and reveal a hierarchical correspondence between LM processing and neural language stages, including a remarkable voxel-level layer preference correlation (). The findings advance a neuroscience-informed framework for evaluating XAI explanations via brain data and suggest that attribution-based explanations offer a complementary lens on language processing that can extend to other cognitive domains and multimodal models. Overall, the study demonstrates the neural plausibility of explanatory signals and positions brain alignment as a rigorous objective for interpreting AI explanations.

Abstract

Large language models (LLMs) not only exhibit human-like performance but also share computational principles with the brain's language processing mechanisms. While prior research has focused on mapping LLMs' internal representations to neural activity, we propose a novel approach using explainable AI (XAI) to strengthen this link. Applying attribution methods, we quantify the influence of preceding words on LLMs' next-word predictions and use these explanations to predict fMRI data from participants listening to narratives. We find that attribution methods robustly predict brain activity across the language network, revealing a hierarchical pattern: explanations from early layers align with the brain's initial language processing stages, while later layers correspond to more advanced stages. Additionally, layers with greater influence on next-word predictionreflected in higher attribution scoresdemonstrate stronger brain alignment. These results underscore XAI's potential for exploring the neural basis of language and suggest brain alignment for assessing the biological plausibility of explanation methods.

Paper Structure

This paper contains 18 sections, 8 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Approach. a, Language processing in the brain and LLMs shows striking parallels. Applying XAI methods, this study investigates how XAI can reveal insights into both systems and their interrelationship. b, To test this hypothesis, we used attribution methods as representatives of XAI. Attribution methods reveal how much each previous word in a context influences the model’s decision about the next incoming word. Using a sliding window approach, we applied these methods to continuous stories. Since the sliding windows overlap, each word in the story is processed within different windows, resulting in multiple importance scores being assigned to a single word. These scores were stored in a vector for each word, forming an attribution feature space. c, We used ridge regression to quantify how well attribution-based explanations predict fMRI brain activity recorded while participants listened to stories. Prediction accuracy was assessed with a brain score, defined as the Pearson correlation between the predicted and actual brain responses on held-out data. We applied this approach to attribution feature spaces generated by four distinct attribution methods (Erasure, Integrated Gradients, Gradient Norm, and Gradient $\times$ Input) across three LLMs (GPT-2, Llama 2, and Phi-2), using four stories. Each feature space was independently tested to predict the fMRI activity of participants for its corresponding story. In total, 147 participants were analyzed across all stories. Additionally, we compared these attribution-based feature spaces to widely used internal LLM representations, namely activations and attention weights, using the same stories, LLMs, and participants. In the figure: x represents the input sequence, $x_t$ denotes the target word (incoming word), and $x'$ is the baseline input. $\textit{f}$ represents the LLM function, while $\textit{y}$ is the recorded brain response (BOLD signal) from a voxel, and $\hat{y}$ is the predicted response. $\textit{Corr.}$ refers to the Pearson correlation used for evaluation. The attribution methods formulas are shown in full in the \ref{['sec:methods']}.
  • Figure 2: Brain score of attribution methods across LLMs.a, Brain score obtained with Gradient Norm and Gradient $\times$ Input attribution methods for Llama 2. Scores were computed voxel-wise for each participant and attribution method, then averaged across all participants and methods. Only significant predicted voxels are color-coded. b, Voxel-wise brain scores were computed for each combination of attribution method, LLM, and individual, then averaged across individuals and voxels within the left hemisphere. Similar patterns were observed in the right hemisphere (Fig. \ref{['fig:S_LLM_attribution_R']}a). c, Brain scores of LLMs across ROIs. Alignment is expressed as brain score, normalized and presented as a percentage of the noise ceiling, which was estimated using intersubject correlation (ISC) Nastase_isc.(The noise ceiling for each ROI is shown in Fig. \ref{['fig:S_noise_ceiling']}.) Markers represent the mean brain score (as % of noise ceiling) for each ROI. Brain scores were computed independently for each combination of Grad Norm and Grad × Input feature spaces, LLMs, and individuals, and then averaged across feature spaces and participants. Error bars indicate the 95% confidence interval across individuals. Marker colors match the brain map, indicating the corresponding brain region, while error bar colors represent different LLMs. Results for right-hemisphere language ROIs are shown in Fig. \ref{['fig:S_LLM_attribution_R']}b. d, The model with the highest brain score for Gradient Norm and Grad $\times$ Input explanation. Only voxels where the differences among brain scores are statistically significant are color-coded. Significant voxels in both a and d were identified using Wilcoxon signed-rank test across individuals. For d, an additional two-step procedure was applied: first, a multiple-comparison using the Friedman test; second, a Wilcoxon signed-rank test was conducted to compare models. All $P$ values were corrected for multiple comparisons using the false discovery rate (FDR), with a significance threshold of $P < 0.05$.
  • Figure 3: Comparing Brain Scores from Attribution and Internal Representations.a, Voxels predicted significantly by only attribution (red), internal representations (blue), or both (light purple). The encoder was fitted on each pair of feature spaces and LLMs independently. To identify significantly predicted voxels, we concatenated the brain scores of corresponding voxels from the same feature spaces regardless of the examined LLM, and applied the Wilcoxon signed-rank test. The significance threshold was set at 0.05 for $P$ value and corrected for multiple comparisons using FDR. b, Brain scores of feature spaces across ROIs. Alignment is expressed as brain score, normalized and presented as a percentage of the noise ceiling. Markers represent the mean brain score (as % of noise ceiling) for each ROI. Brain scores were computed independently for each combination of feature space, LLM, and individual, then averaged across LLMs and participants. Error bars indicate the 95% confidence interval across individuals. Marker colors match the brain map, indicating the corresponding brain region, while error bar colors represent different feature spaces. Results for right-hemisphere language ROIs are shown in Fig. \ref{['fig:s_feature_space_comparison_R']}.
  • Figure 4: Hierarchical alignment between GPT-2 layer conductance and brain activity.a, Layer preference per voxel based on conductance scores. The encoder was inputted with the conductance of each GPT-2 layer independently, and significant voxels ($P$$<$ 0.01, Wilcoxon signed-rank test with FDR correction) are color-coded by the layer whose conductance provided the best prediction accuracy. b, Distribution of layer preference per voxel within brain regions involved in language processing. The percentage of voxels within language-related brain regions (Heschl's gyrus and Heschl's sulcus, superior temporal gyrus, superior temporal sulcus, inferior frontal gyrus, and angular gyrus) is plotted for each layer’s conductance, showing a hierarchical organization that aligns model layers with language-processing stages. Early model layers predominantly predict auditory regions, while higher layers align with regions supporting complex language functions. c, Relationship between layer-wise prediction performance and brain alignment. The percentage of brain voxels aligned with each layer (blue) was compared with the percentage of words most influenced by each layer (orange), based on conductance scores. The correlation (Pearson’s r = 0.97, $P = 2.2 \times 10^{-7}$) demonstrates a strong link between the importance of model layers in language representation and their predictive relevance for brain activity. (d) Distribution of layer importance for words clustered by part of speech (POS). Words from different stories are aggregated and grouped by POS categories (e.g., nouns, verbs, adjectives).
  • Figure S1: Brain score across attribution methods and LLMs in the right hemisphere.a, Voxel-wise brain scores for each combination of attribution method, LLM, and individual, averaged across individuals and voxels within the right hemisphere. This panel is analogous to Fig. \ref{['fig: Attribution predict']}b but for the right hemisphere. b, Brain scores of LLMs across right-hemisphere ROIs. Alignment is expressed as brain score normalized and presented as a percentage of the noise ceiling, estimated using intersubject correlation. This panel mirrors Fig. \ref{['fig: Attribution predict']}c. Markers represent the mean brain score (as % of noise ceiling) for each ROI. Error bars indicate the 95% confidence interval across individuals. Marker colors match the brain map, indicating the corresponding brain region, while error bar colors represent different LLMs.
  • ...and 6 more figures