Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden; Curtis Fox; Christos Thrampoulidis

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden, Curtis Fox, Christos Thrampoulidis

TL;DR

The paper addresses how many distinct contexts a decoder-only transformer can interpolate in next-token prediction, formalizing next-token prediction capacity and proving upper bounds that hold in general and empirical settings, alongside a matching lower bound for a one-layer, multi-head transformer. The analysis hinges on an injectivity property of self-attention and a rank-based argument for the FNN, with token-averaging offered as a simple, equivalent mechanism. It shows that the capacity scales as $\\Omega\bigl(\frac{k}{\zeta-1}\bigr)$ and, under real-analytic activations not polynomial, is achievable with $\Theta\bigl(\frac{k}{\zeta-1}\bigr)$ parameters, with empirical data suggesting training toward the entropy lower bound at $\Theta(n\zeta)$ parameters. The work provides a rigorous theoretical lens on memorization for next-token prediction, clarifying fundamental limits and informing architectural choices for transformer design and optimization.

Abstract

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.

Next-token prediction capacity: general upper bounds and a lower bound for transformers

TL;DR

and, under real-analytic activations not polynomial, is achievable with

parameters, with empirical data suggesting training toward the entropy lower bound at

parameters. The work provides a rigorous theoretical lens on memorization for next-token prediction, clarifying fundamental limits and informing architectural choices for transformer design and optimization.

Abstract

Paper Structure (19 sections, 12 theorems, 35 equations, 2 figures, 1 table)

This paper contains 19 sections, 12 theorems, 35 equations, 2 figures, 1 table.

Introduction
Results
Related Work
Probability of language
Memory capacity
Optimization of transformers
Miscellaneous results
Organization
Preliminaries
Probabilistic Language Space
Next-token Prediction Capacity
Transformer Model
Transformer Next-token Prediction Capacity
Injectivity of Self-attention
Rank Results
...and 4 more sections

Key Result

Lemma 1

Let $k\ge 1$. Let $M$ and $N$ be $C^k$ manifolds of dimension $m$ and $n$ respectively. Let $F:M\to N$ be $C^k$. If $m\le n$ or $m\le n+k-1$, then the set of critical values of $F$ has measure zero in $N$.

Figures (2)

Figure 1: We show that our model requires more parameters to memorize an increasing number of unique contexts. Left: As the hidden dimension, $m$, increases and the number of unique contexts, $n$, decreases, the gap between the training error and the entropy lower bound trends downwards. Right: As the number of unique contexts increases, the minimum number of parameters required for the gap between the training error and the entropy lower-bound to fall below the minimum threshold increases.
Figure 2: Even when only training the FNN layers, we show that our model can memorize an increasing number of unique contexts as the hidden dimension $m$ is increased. Left: As the hidden dimension, $m$, increases and the number of unique contexts, $n$, decreases, the gap between the training error and the entropy lower bound trends downwards. Right: As the number of unique contexts increases, the minimum number of parameters required for the gap between the training error and the entropy lower-bound to fall below the minimum threshold increases (this trains both FNN linear layers, will check if only training the last layer works as well and let you know).

Theorems & Definitions (33)

Lemma 1: Sard's theorem
Definition 2
Definition 3
Definition 4
Proposition 5
proof
Proposition 6
proof
Definition 7
Example 8
...and 23 more

Next-token prediction capacity: general upper bounds and a lower bound for transformers

TL;DR

Abstract

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (33)