Table of Contents
Fetching ...

Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models

Emily Cheng, Richard J. Antonello

TL;DR

The paper investigates why intermediate LLM layers best predict human brain activity and tests the hypothesis that a two-phase abstraction process underlies this brain–LM alignment. By measuring brain-model representational similarity, nonlinear intrinsic dimensionality $I_d$, and linear dimensionality $d$, and by estimating layerwise surprisal with TunedLens, the authors show that $I_d$ tracks encoding performance and that a phase transition occurs as layers transition from an abstraction (composition) phase to a prediction (extraction) phase. They demonstrate concurrent emergence of high $I_d$ and strong encoding performance during training across model families, and argue that the brain–LM correspondence is driven by abstraction properties rather than pure next-token prediction. The work has implications for improving encoding models by leveraging multi-layer spectral information and informs theories of cognitive language processing in both brains and LLMs.

Abstract

Research has repeatedly demonstrated that intermediate hidden states extracted from large language models are able to predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most capable for this unique and highly general transfer task? In this work, we show that evidence from language encoding models in fMRI supports the existence of a two-phase abstraction process within LLMs. We use manifold learning methods to show that this abstraction process naturally arises over the course of training a language model and that the first "composition" phase of this abstraction process is compressed into fewer layers as training continues. Finally, we demonstrate a strong correspondence between layerwise encoding performance and the intrinsic dimensionality of representations from LLMs. We give initial evidence that this correspondence primarily derives from the inherent compositionality of LLMs and not their next-word prediction properties.

Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models

TL;DR

The paper investigates why intermediate LLM layers best predict human brain activity and tests the hypothesis that a two-phase abstraction process underlies this brain–LM alignment. By measuring brain-model representational similarity, nonlinear intrinsic dimensionality , and linear dimensionality , and by estimating layerwise surprisal with TunedLens, the authors show that tracks encoding performance and that a phase transition occurs as layers transition from an abstraction (composition) phase to a prediction (extraction) phase. They demonstrate concurrent emergence of high and strong encoding performance during training across model families, and argue that the brain–LM correspondence is driven by abstraction properties rather than pure next-token prediction. The work has implications for improving encoding models by leveraging multi-layer spectral information and informs theories of cognitive language processing in both brains and LLMs.

Abstract

Research has repeatedly demonstrated that intermediate hidden states extracted from large language models are able to predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most capable for this unique and highly general transfer task? In this work, we show that evidence from language encoding models in fMRI supports the existence of a two-phase abstraction process within LLMs. We use manifold learning methods to show that this abstraction process naturally arises over the course of training a language model and that the first "composition" phase of this abstraction process is compressed into fewer layers as training continues. Finally, we demonstrate a strong correspondence between layerwise encoding performance and the intrinsic dimensionality of representations from LLMs. We give initial evidence that this correspondence primarily derives from the inherent compositionality of LLMs and not their next-word prediction properties.
Paper Structure (19 sections, 1 equation, 10 figures, 2 tables)

This paper contains 19 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Analyzing Layerwise Representational Trends: (a)$I_d$ is well correlated with encoding performance across model sizes. $I_d$ is normalized here by the log of embedding size to account for power law scaling. (b) The abstract-predict phase transition at layer 17 is shown for OPT-1.3b. At the peak of encoding performance (red dashed line), the next-token prediction loss (blue curve) sharply decreases, corresponding with a decrease in encoding performance. (c) A flatmap of the brain, for one subject, is shown colored voxelwise by the correlation over layers between $I_d$ and encoding performance. With the exception of auditory cortex (bright), which captures low-level spectral information, encoding performance in brain regions thought to perform higher-level linguistic processing (dark) is well-captured by representational $I_d$.(d) The layer-wise representational similarity computed with linear CKA is shown for OPT-1.3B.
  • Figure 2: Encoding Performance and Intrinsic Dimensionality Peaks Manifest Concurrently over Training: (a) - The evolution of layerwise encoding performance over training of the Pythia 6.9B model is shown. A peak is reached at layer 13 of the model. (b) -Likewise, a peak in $I_d$ at layer 13 manifests over training. Red dots in each figure denote maximal layers for the respective metric.
  • Figure B.1: GRIDE scale analysis for Pythia-6.9b. The estimated intrinsic dimension (y axis) varies according to the chosen scale $k$ (x axis). It is recommended to choose a scale where the local change is minimal, in this case, $k=2^4$.
  • Figure E.1: Remaining tuned lens results for OPT-125, OPT-13B, and Pythia-6.9B
  • Figure E.2: Voxelwise ID correlation results as in Figure 1c for OPT-125M
  • ...and 5 more figures