Table of Contents
Fetching ...

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo

TL;DR

This work provides an explanation through the lens of Hessian of why SGD performs worse than Adam on Transformers, and finds that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists.

Abstract

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.

Why Transformers Need Adam: A Hessian Perspective

TL;DR

This work provides an explanation through the lens of Hessian of why SGD performs worse than Adam on Transformers, and finds that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists.

Abstract

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.
Paper Structure (47 sections, 6 theorems, 17 equations, 16 figures, 3 tables, 5 algorithms)

This paper contains 47 sections, 6 theorems, 17 equations, 16 figures, 3 tables, 5 algorithms.

Key Result

Proposition 1

(Lower bound for GD.) Consider $\min _w \mathcal{L}(w) =\frac{1}{2} w^T H w- h^T w$ where $H \in \mathbb{R}^{d \times d}$ is positive definite and $h \in \mathbb{R}^{d}$. Let $w_{GD}^t$ be the output of GD after $t$ steps. There exists a block diagonal matrix $H$, $h$ and an initial point $w^0$, s.t where $\kappa$ is the condition number of $H$.

Figures (16)

  • Figure 1: The full Hessian spectra of CNNs (VGG16 and ResNet18) and Transformers (GPT2, GPT2-nano, and ViT-base) at different training stages. The $x$-axis records the eigenvalues and the $y$-axis records the frequency in the log scale. To allow comparison in the same figure, the plotted spectra are normalized by their 10th largest eigenvalues. We find that the spectra on CNNs and Transformers are largely similar.
  • Figure 2: (a): The Hessian of an MLP after 1 training step reported in collobert2004large. (b,c,d): We calculate the Hessians of an MLP (with 8 neurons) at different training stages. We find the near-block-diagonal structure maintains along the training.
  • Figure 3: (a) (c): The blockwise Hessian spectra of VGG16 (CNN) and BERT (Transformer) at initialization. The $x$-axis records the eigenvalues and the $y$-axis records the frequency in the log scale. To allow comparison in the same figure, we sample 4 blocks in each model. The plotted spectra are normalized by their 10th largest eigenvalues. The spectra are similar among blocks for VGG and differ significantly across blocks for BERT. (b) (d) Adam v.s. SGD for training VGG16 and BERT.
  • Figure 4: The JS distance among blockwise Hessian spectra at initialization. We find that the JS distance of blockwise spectra in CNNs is significantly smaller than that in Transformers.
  • Figure 5: (a) SGD v.s. Adam on a man-made MLP with different degrees of heterogeneity $c$. Each point records the best-converged test accuracy under the learning rate grid search. SGD performs worse as heterogeneity grows. (b) The JS distance among blockwise Hessian spectra for MLP-mixer tolstikhin2021mlp at initialization. We observe heterogeneity. (c) SGD performs worse than Adam on MLP-mixer.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Theorem 1
  • Proposition 2
  • Theorem 2
  • Theorem 3
  • Theorem 4