Table of Contents
Fetching ...

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett

TL;DR

The training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression are studied and it is shown that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate.

Abstract

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

TL;DR

The training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression are studied and it is shown that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate.

Abstract

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.
Paper Structure (28 sections, 28 theorems, 236 equations, 2 figures, 1 algorithm)

This paper contains 28 sections, 28 theorems, 236 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

The population loss $L(\theta)$ and the regularized population loss $Q(\theta)$ have the following properties:

Figures (2)

  • Figure 1: We consider the linear regression problem where our algorithm uses the spectral initialization as in Algorithm \ref{['alg:gd']}, and SGD is initialized randomly, with each parameter being drawn i.i.d. from $\mathcal{N}(0, 1)$.
  • Figure 2: We consider the linear regression problem where both our algorithm and SGD are initialized at the same random point, with each parameter being drawn i.i.d. from $\mathcal{N}(0, 1)$.

Theorems & Definitions (56)

  • Theorem 1
  • proof
  • Lemma 1
  • Theorem 2: A Data-Compute Scaling Law for Softmax Self-Attention
  • Lemma 2: Descent Lemma
  • Lemma 3: Good initialization occurs with high probability
  • Lemma 4
  • Theorem 3: Uniform approximation of expected empirical gradient by population gradient
  • proof
  • proof
  • ...and 46 more