Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Emily Cheng; Diego Doimo; Corentin Kervadec; Iuri Macocco; Jade Yu; Alessandro Laio; Marco Baroni

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, Marco Baroni

TL;DR

This work investigates how transformer language models organize linguistic information by analyzing the evolving intrinsic dimension (ID) of layer representations using GRIDE across five LMs and three corpora. It introduces the Information Imbalance Delta to study the neighborhood structure of representations and cross-model similarities. A central high-ID phase appears in intermediate layers, marking a transition to abstract syntactic and semantic processing and predicting downstream transfer performance. The findings reveal cross-model geometric convergence at the ID peak and have practical implications for layer-wise pruning, fine-tuning, and model interfacing, suggesting that core linguistic processing concentrates in a distinct mid-layer phase.

Abstract

A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

TL;DR

Abstract

Paper Structure (43 sections, 1 equation, 25 figures, 1 table)

This paper contains 43 sections, 1 equation, 25 figures, 1 table.

Introduction
Related work
Methods
Models
Data
Probing and downstream tasks
Intrinsic Dimension
Quantifying the relative information content of different representations
Results
Emergence of a central high-dimensionality phase
ID is a geometric signature of learned structure
The ID peak marks a transition in layer function
At the ID peak, different models share representation spaces
Language processing during the high-dimensionality phase
The ID-peak representations contain less surface-form information
...and 28 more sections

Figures (25)

Figure 1: Average ID over layers over 5 random partitions from each of three corpora: Bookcorpus, the Pile and Wikitext. (Left): different models' layerwise ID plotted for the original corpora. (Center): different models' layerwise ID plotted for the shuffled corpora. (Right): different Pythia training checkpoints' layerwise ID on the original corpora, where darker curves are later checkpoints. In the middle layers, shuffled corpus ID (center) is lower than non-shuffled ID (left), suggesting that linguistic processing contributes to ID expansion. ID increases over the course of training for all layers to reach the final profile at step 143000 (right), suggesting that ID reflects learned linguistic features. All curves are shown with $\pm$ 2 standard deviations (shuffled SDs are very small).
Figure 2: For Llama, OPT, Pythia (left to right), the ID is overlaid with $\Delta(l_i \to l_{first})$ (gray) and $\Delta(l_i \to l_{last})$ (brown). Plots are shown with $\pm$ 2 standard deviations over 5 partitions of 3 corpora. For all models, there is a peak in $\Delta(l_i \to l_{first})$ (gray) around the ID peak.
Figure 3: Forward $\Delta$ scope (left: Llama; center: OPT; right: Pythia): continuous lines report, for each layer $l_n$, the number of adjacent following layers $l_{n+k}$ for which $\Delta(l_n \to l_{n+k})\leq 0.1$. The dashed line represents the longest possible scope for each layer. Values are averaged across corpora and partitions, with error bars of $\pm$ 2 standard deviations.
Figure 4: Cross-model $\Delta$. ID-peak sections are shaded in orange. Different symbols mark different $\Delta$ levels in the two directions (lower values correspond to a stronger trend towards information containment). High $\Delta$ scores ($>0.1$), corresponding to low information containment, are not shown. Values averaged over corpora and partitions.
Figure 5: Linguistic knowledge probing performance $\pm$ 2 SDs across 5 random seeds is shown with the ID profile for Llama, OPT, and Pythia (left to right). Row (a) corresponds to surface-form tasks Sentence Length and Word Content, where probe performance decreases through the ID peak. Row (b) corresponds to syntactic and semantic tasks Bigram Shift, Coordination Inversion and Odd Man Out, where probe performance for all tasks attains maximum (or close) within the ID peak. This suggests the ID peak marks abstract, and not surface, representations of the input.
...and 20 more figures

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

TL;DR

Abstract

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (25)