Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Ruiquan Huang; Yingbin Liang; Jing Yang

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Ruiquan Huang, Yingbin Liang, Jing Yang

TL;DR

The paper tackles the finite-time training dynamics of a one-layer decoder-only Transformer for next-token prediction by introducing a realizable dataset framework based on collocation and query-dependent partial orders. It develops a two-stage normalized gradient descent algorithm that decouples training of the feed-forward and attention components, driving them toward their max-margin solutions and achieving linear convergence of cross-entropy. Theoretical results establish sub-linear directional convergence to $W_{\mathrm{ov}}^*$ and $W_{\mathrm{kq}}^*$ with linear loss decay, augmented by generalization to unseen data via extended partial orders, and are supported by synthetic experiments. Collectively, the work provides non-asymptotic insights into implicit bias and generalization in Transformer training for NTP, with potential implications for understanding finite-time behavior in larger models.

Abstract

Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

TL;DR

and

with linear loss decay, augmented by generalization to unseen data via extended partial orders, and are supported by synthetic experiments. Collectively, the work provides non-asymptotic insights into implicit bias and generalization in Transformer training for NTP, with potential implications for understanding finite-time behavior in larger models.

Abstract

Paper Structure (25 sections, 15 theorems, 106 equations, 2 figures, 1 algorithm)

This paper contains 25 sections, 15 theorems, 106 equations, 2 figures, 1 algorithm.

Introduction
Related Work
Problem Setup
Realizable Training Dataset and Two-Stage Algorithm
Realizable Training Dataset
Training Algorithm
Training Dynamics of the Transformer
Convergence of Training $W_{\mathrm{ov}}$
Convergence of Training $W_{\mathrm{kq}}$
Proof Sketch of \ref{['thm: IB rate']}
Generalization Ability
Experiment
Conclusion
Expression of Gradients
Proof of \ref{['prop: V converge']}
...and 10 more sections

Key Result

Proposition 1

Let $W_{\mathrm{ov}}^*$ be defined in eqn: opt V. Under Assumptions assm: realizable dataset-assm:orthornomal, let $W_{\mathrm{ov}}^{(t)}$ be updated by alg. Then, for any $t\geq 2$, we have $\frac{t \eta_0 }{2\|W_{\mathrm{{o}v}}^*\|} \leq \|W_{\mathrm{{o}v}}^{(t)}\| \leq t\eta_0$ and the following Moreover, the loss function $\mathcal{L}_0$ satisfies that $\mathcal{L}_0(W_{\mathrm{ov}}^{(t)})\le

Figures (2)

Figure 1: The left plot shows the mapping from sentence to the next token. The red rectangle indicates the optimal token in the corresponding sentence. The right plot shows the collocation relationship.
Figure 2: Training dynamics of single-layer transformer for NTP.

Theorems & Definitions (18)

Definition 1: $x^q$-partial order
Example 1
Proposition 1
Theorem 1
Theorem 2: Loss Convergence
Proposition 2
Theorem 3
Example 2: Generalization to unseen data in Example 1
Lemma 1
Lemma 2
...and 8 more

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

TL;DR

Abstract

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)