When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Gabriele Merlin; Mariya Toneva

When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Gabriele Merlin, Mariya Toneva

Abstract

While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models--LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.

When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Abstract

Paper Structure (37 sections, 2 equations, 65 figures, 3 tables)

This paper contains 37 sections, 2 equations, 65 figures, 3 tables.

Introduction
Related Works
Methodology
Pretrained Models
FMRI Data
Controlling Brain Alignment
Brain Misaligned Model
Brain Preserving Model
Brain Tuned Model
Model Selection and Training
Conditions for a successful comparison between models.
Evaluation
Language modeling.
Brain alignment.
Linguistic competence.
...and 22 more sections

Figures (65)

Figure 1: A schematic of the proposed approach. Our method is based on fine-tuning a pretrained language model with two simultaneous objectives: maintaining its language modeling ability while reducing its alignment with brain recordings. Language modeling performance is preserved by continuing training on a fine-tuning dataset using the standard language modeling objective. Brain alignment is reduced by introducing a second prediction head and a gradient reversal layer, which encourages the model to produce representations that are uninformative about the corresponding brain activity.
Figure 2: Brain alignment of the BERT-based Brain Preserving (A) and Brain Misaligned (B) models for one participant on the Harry Potter dataset (see Appendix \ref{['app:complete_results_bert_lora']} for all participants), and the difference between the two (C). The Brain Misaligned model exhibits substantially weaker alignment, particularly in language regions (C, D).
Figure 3: Average win rate and standard error across models and dataset combinations of the Brain Misaligned and Brain Preserving models across tasks (Left) and across different linguistic subfields (Right). The average win rate indicates how often each model outperforms its counterpart across model and dataset combinations. The Brain Preserving model significantly outperforms the Brain Misaligned model ($p<0.05$, Wilcoxon signed-rank test) (Left). This result suggests that removing brain alignment impairs linguistic competence. The Brain Preserving model shows a higher win rate in all the linguistic subfield, in particular for semantics and syntax (Right), even if the differences are not statistically significant (assessed using Wilcoxon signed-rank test with Holm-Bonferroni correction), because of unique differences across model-dataset combinations.
Figure 4: Average win rate with standard error across model and dataset combinations, across various linguistic phenomena for the Brain Misaligned and Brain Preserving models. Each bar represents the average win rate for a specific linguistic phenomenon, with error bars indicating standard error. Brain Preserving models tend to outperform Brain Misaligned models in the majority of tasks. Some concrete examples of the linguistic tasks are provided in the Table \ref{['tab:examples-phenomena']}.
Figure 5: Average win rate and standard error across models and dataset combinations of the Brain Preserving and Brain Tuned models across tasks (Left) and across different linguistic subfields (Right). The Brain Tuned model significantly outperforms the Brain Preserving model ($p<0.05$, Wilcoxon signed-rank test) (Left). This result suggests that improving the brain alignment lead to performance gains in linguistic competence. The Brain Tuned model shows a higher win rate in the discourse, morphology, reasoning, semantics and syntax subfield (Right) and significantly higher in semantics and syntax ($p<0.05$, Wilcoxon signed-rank test with Holm-Bonferroni correction), suggesting that improving brain alignment affects semantics and syntax tasks.
...and 60 more figures

When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Abstract

When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Authors

Abstract

Table of Contents

Figures (65)