Table of Contents
Fetching ...

Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

TL;DR

The paper addresses the challenge of training Transformer models without learning rate warmup by identifying spectral energy concentration in $({\boldsymbol{W}_q}^{\top}{\boldsymbol{W}_k})$ as a key cause of malignant entropy collapse in attention. Through a matrix-calculus analysis, it shows how gradients propagate via the self-attention Jacobian and how large top singular values can trigger instability. Motivated by Weyl's inequality, the authors propose AdamW$^2$, which bounds the learning rate with $\alpha_t \le \tau \frac{\sigma_1({\boldsymbol{W}_{t-1}})}{\sigma_1(\nabla {\boldsymbol{W}}_t)}$ and uses fast power iterations to estimate $\sigma_1$, enabling stable training without warmup. Empirical results across ViT, Swin, GPT, Flatten-Swin, and large-scale models (ViT-g, nanoGPT-large) demonstrate competitive performance without warmup, highlighting the method's practicality and robustness for scaling Transformer training.

Abstract

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of ${\bW_q}^{\top} \bW_k$, which is the reason for a malignant entropy collapse, where ${\bW_q}$ and $\bW_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio $\frac{σ_{1}(\nabla \bW_t)}{σ_{1}(\bW_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $\frac{σ_{1}(\bW_{t-1})}{σ_{1}(\nabla \bW_t)}$, where $\nabla \bW_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

Taming Transformer Without Using Learning Rate Warmup

TL;DR

The paper addresses the challenge of training Transformer models without learning rate warmup by identifying spectral energy concentration in as a key cause of malignant entropy collapse in attention. Through a matrix-calculus analysis, it shows how gradients propagate via the self-attention Jacobian and how large top singular values can trigger instability. Motivated by Weyl's inequality, the authors propose AdamW, which bounds the learning rate with and uses fast power iterations to estimate , enabling stable training without warmup. Empirical results across ViT, Swin, GPT, Flatten-Swin, and large-scale models (ViT-g, nanoGPT-large) demonstrate competitive performance without warmup, highlighting the method's practicality and robustness for scaling Transformer training.

Abstract

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of , which is the reason for a malignant entropy collapse, where and are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of , where is the updating quantity in step . Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

Paper Structure

This paper contains 29 sections, 7 theorems, 36 equations, 15 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $\boldsymbol{P} = {\boldsymbol{X}^{\top} {\boldsymbol{W}_q}^{\top} {\boldsymbol{W}_k} {\boldsymbol{X}} }$, where $\boldsymbol{X} \in \mathcal{R}^{d\times n}, \boldsymbol{W}_q \in \mathcal{R}^{d_q\times d}, \boldsymbol{W}_k \in \mathcal{R}^{d_q\times d}$, according to vectorization and matrix ca where $\otimes$ denotes Kronecker product, $\boldsymbol{I}_n \in \mathcal{R}^{n\times n}$ denotes a

Figures (15)

  • Figure 1: Training dynamics of a failure ViT. This figure shows how the values of these 15 items as shown in Equation \ref{['eq:15_terms']} change as the number of training steps increases. Please pay more attention to subfigures (a)-(e).
  • Figure 2: Visualization of the dynamics process of attention map in different training steps for a successful and a crashed ViT-Base model, respectively. Please click the images to play the flash. Best viewed with Acrobat Reader.
  • Figure 3: Three attention modes. The left panel shows a normal attention map. The middle panel shows a classical attention map when the model crashes, for which the entropy is almost 0. The right panel shows an attention map from a normal model training while its entropy is almost 0.
  • Figure 4: Comparison of spectral energy concentration index between a successfully trained model and a crashed model. Figure shows the results of three different blocks. The spectral energy distributes in all directions in a successful training case. The spectral energy only concentrates on a few directions in a crashed model.
  • Figure 5: Attribution flow chart of attention collapse.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Proposition 1: Matrix Calculus for Self-Attention
  • Theorem 1: Spectral Energy Concentration leads to malignant entropy collapse
  • Theorem 2: Weyl's Inequality on Singular Values.
  • Definition 1: Kronecker Product
  • Proposition 2: Property of Vectorization for Matrix Product
  • Proposition 3: Expectation of ${\boldsymbol{x}_i}^{\top} \boldsymbol{W} \boldsymbol{x}_i$
  • Proposition 4: Expectation of ${\boldsymbol{x}_i}^{\top} \boldsymbol{W} \boldsymbol{x}_j$ for $i \neq j$
  • Theorem 3: Courant-Fischer Min-max Principle for Singular Values