Table of Contents
Fetching ...

Transformers need glasses! Information over-squashing in language tasks

Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G. M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković

TL;DR

It is proved that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token, which relates to the well-known phenomenon of over-squashing in graph neural networks.

Abstract

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

Transformers need glasses! Information over-squashing in language tasks

TL;DR

It is proved that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token, which relates to the well-known phenomenon of over-squashing in graph neural networks.

Abstract

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
Paper Structure (33 sections, 16 theorems, 20 equations, 12 figures)

This paper contains 33 sections, 16 theorems, 20 equations, 12 figures.

Key Result

Lemma 4.1

Consider two sequences $\mathbf{x}, \mathbf{x}^* \in \mathbb{R}^n$ such that $\lim_{n \to \infty} \left \lvert \mathbf{x}_n - \mathbf{x}^*_n \right \rvert = 0$. Then, the $L_1$ difference of their softmax tends to $0$.

Figures (12)

  • Figure 1: (a) Representational Collapse (Theorem \ref{['thm:convergence-of-reps']}). From top to bottom, we have a series of sequences given to Transformer architectures, each comprising repeated 1 tokens with a single 0 token at the end. The color and proximity of the curved lines illustrate how these representations converge as sequence length increases. (b) Over-squashing (Theorem \ref{['thm:oversquashing']}). Due to the architecture of decoder-only Transformers, tokens that are earlier in their input sequence will have significantly more paths through which their data can reach the representation used for next-token prediction, leading to 'over-squashing'. This effect is depicted here for an early token (blue) and later token (red) in a five-token sequence.
  • Figure 2: Results on simple copying tasks. (a). Gemini was prompted to predict the last token (diamond) of a sequences ‘1...10’ or the first token (square) of a sequence ‘01...1’. (b). Same as (a) but with hints (see \ref{['sec:exps-counting']} for details) (c). Same as (a) but the sequences have interleaved 0s and 1s. See \ref{['app:exp-details']} for extra details
  • Figure 3: Gemini 1.5 being prompted to sum $1+\dots+1$ (Column 1), Count the number of ones in a sequence of 1s (Column 2), Count the number of ones in a sequence of ones and zeroes (the sequence is a Bernoulli sequence with probability of sampling a one being 0.7) (Column 3), and to counter the number of times a word appears in a sentence (Column 4).
  • Figure 4: Frequency of different outputted values for Gemini 1.5 for the counting tasks. The large density at $100$ suggests that Gemini is likely not counting, but instead possibly performing some crude form of subitising.
  • Figure 5: Representational collapse for counting (a, b) and copying (c, d) tasks.
  • ...and 7 more figures

Theorems & Definitions (26)

  • Lemma 4.1: Informal
  • Theorem 4.2: Representational Collapse -- informal
  • Theorem 5.1: Over-squashing in Transformers
  • Proposition 5.2: Informal
  • Proposition 6.1
  • Corollary 6.2: Informal
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 16 more