Table of Contents
Fetching ...

Neural Correlates of Language Models Are Specific to Human Language

Iñigo Parra

TL;DR

The paper investigates whether correlations between transformer hidden states and fMRI during language tasks reflect true language-specific neural representations or arise from generic statistics. It uses a comprehensive set of analyses—dimensionality reduction checks, centered kernel alignment, Gromov-Wasserstein geometry, and language-specific controls—to show that alignment hinges on positional encodings and linguistic exposure, persisting after PCA and disappearing with non-linguistic training. Deeper, language-trained models yield stronger brain alignment, while attention weights play only a marginal role, suggesting that the core similarity resides in representational structure shaped by language experience. Together, these results bolster the biological plausibility of transformer-based language processing and clarify the conditions under which brain–LM correspondences manifest, with implications for interpretability and cross-domain neuroscience.

Abstract

Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.

Neural Correlates of Language Models Are Specific to Human Language

TL;DR

The paper investigates whether correlations between transformer hidden states and fMRI during language tasks reflect true language-specific neural representations or arise from generic statistics. It uses a comprehensive set of analyses—dimensionality reduction checks, centered kernel alignment, Gromov-Wasserstein geometry, and language-specific controls—to show that alignment hinges on positional encodings and linguistic exposure, persisting after PCA and disappearing with non-linguistic training. Deeper, language-trained models yield stronger brain alignment, while attention weights play only a marginal role, suggesting that the core similarity resides in representational structure shaped by language experience. Together, these results bolster the biological plausibility of transformer-based language processing and clarify the conditions under which brain–LM correspondences manifest, with implications for interpretability and cross-domain neuroscience.

Abstract

Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: A-B: Differences in brain correlations across positional embedding and dimensionality conditions (PCA vs full dimensions) for causal (A) and bidirectional (B) models. C: Selected language processing regions of interest (ROI) of each subject. D-E: Brain-model correlation trends across model depth for bidirectional (D) and causal (E) models.
  • Figure 2: All models' layer-wise brain scores (left bidirectional models, right causal models). The results of the [+pos] condition are shown in blue; orange shows the results for [-pos] condition. Dashed lines show results using PCA data. Purple line indicates overall mean. 0% indicates input layer; 100% corresponds to output layer. Shading indicates range.
  • Figure 3: Comparison of model-brain alignment across representational levels and architectural configurations. A-B: Average BrainScores for each model under different representational components. The figures show how bidirectional (top) and causal (bottom) architectures differ in their alignment with neural responses. C-D: Empirical cumulative distribution functions (ECDFs) of head-level BrainScores, comparing the distribution of brain alignment across individual attention heads for each model (top bidirectional; bottom causal). Curves positioned further to the right indicate that a greater proportion of heads exhibit higher correlations with brain activity. Across panels, causal models consistently outperform bidirectional models, and the inclusion of positional encoding tends to improve model-brain correspondence, suggesting that positional information enhances the representational similarity between transformer models and neural data.
  • Figure 4: Participant-model CKA (top) and Gromov-Wasserstein distance (bottom) results for [-pos] and [+pos] conditions. For GW distance $\downarrow$ is better; for CKA $\uparrow$ is better.
  • Figure 5: Mean brain scores for non-linguistic control models versus the weakest bidirectional language model, xlm-roberta.
  • ...and 1 more figures