Next-token prediction capacity: general upper bounds and a lower bound for transformers
Liam Madden, Curtis Fox, Christos Thrampoulidis
TL;DR
The paper addresses how many distinct contexts a decoder-only transformer can interpolate in next-token prediction, formalizing next-token prediction capacity and proving upper bounds that hold in general and empirical settings, alongside a matching lower bound for a one-layer, multi-head transformer. The analysis hinges on an injectivity property of self-attention and a rank-based argument for the FNN, with token-averaging offered as a simple, equivalent mechanism. It shows that the capacity scales as $\\Omega\bigl(\frac{k}{\zeta-1}\bigr)$ and, under real-analytic activations not polynomial, is achievable with $\Theta\bigl(\frac{k}{\zeta-1}\bigr)$ parameters, with empirical data suggesting training toward the entropy lower bound at $\Theta(n\zeta)$ parameters. The work provides a rigorous theoretical lens on memorization for next-token prediction, clarifying fundamental limits and informing architectural choices for transformer design and optimization.
Abstract
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.
