Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel; Mahdi Soltanolkotabi; Peter Bartlett

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett

TL;DR

The training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression are studied and it is shown that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate.

Abstract

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

TL;DR

Abstract

Paper Structure (28 sections, 28 theorems, 236 equations, 2 figures, 1 algorithm)

This paper contains 28 sections, 28 theorems, 236 equations, 2 figures, 1 algorithm.

Introduction
Main Contributions
Related Work
Model
Gradient oracle model.
Assumptions.
Notation
Structure of the population loss
Main Result
Experiments
Proofs of Lemmas
Useful facts.
Proof of Lemma \ref{['theta-star-lemma']}
Proof of Lemma \ref{['descent-lemma']}
Proof of Lemma \ref{['good-events-lemma']}
...and 13 more sections

Key Result

Theorem 1

The population loss $L(\theta)$ and the regularized population loss $Q(\theta)$ have the following properties:

Figures (2)

Figure 1: We consider the linear regression problem where our algorithm uses the spectral initialization as in Algorithm \ref{['alg:gd']}, and SGD is initialized randomly, with each parameter being drawn i.i.d. from $\mathcal{N}(0, 1)$.
Figure 2: We consider the linear regression problem where both our algorithm and SGD are initialized at the same random point, with each parameter being drawn i.i.d. from $\mathcal{N}(0, 1)$.

Theorems & Definitions (56)

Theorem 1
proof
Lemma 1
Theorem 2: A Data-Compute Scaling Law for Softmax Self-Attention
Lemma 2: Descent Lemma
Lemma 3: Good initialization occurs with high probability
Lemma 4
Theorem 3: Uniform approximation of expected empirical gradient by population gradient
proof
proof
...and 46 more

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

TL;DR

Abstract

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (56)