Table of Contents
Fetching ...

Why are Adaptive Methods Good for Attention Models?

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

TL;DR

<3-5 sentence high-level summary> The paper investigates why adaptive methods like Adam outperform SGD on attention models by examining heavy-tailed stochastic gradient noise, proposing clipping as a principled stabilization technique. It introduces GClip and the adaptive coordinate-wise clipping ACClip, provides convergence bounds under heavy-tailed noise with $\alpha \in (1,2]$, and establishes lower bounds showing optimality. Through online moment estimation and coordinate-wise clipping, ACClip achieves faster convergence and outperforms Adam on BERT pretraining and fine-tuning, demonstrating practical improvements for transformer training. The findings illuminate how gradient noise structure influences optimizer performance and offer a practical, theoretically-grounded approach for robust deep learning optimization.

Abstract

While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an \emph{adaptive} coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.

Why are Adaptive Methods Good for Attention Models?

TL;DR

<3-5 sentence high-level summary> The paper investigates why adaptive methods like Adam outperform SGD on attention models by examining heavy-tailed stochastic gradient noise, proposing clipping as a principled stabilization technique. It introduces GClip and the adaptive coordinate-wise clipping ACClip, provides convergence bounds under heavy-tailed noise with , and establishes lower bounds showing optimality. Through online moment estimation and coordinate-wise clipping, ACClip achieves faster convergence and outperforms Adam on BERT pretraining and fine-tuning, demonstrating practical improvements for transformer training. The findings illuminate how gradient noise structure influences optimizer performance and offer a practical, theoretically-grounded approach for robust deep learning optimization.

Abstract

While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an \emph{adaptive} coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.

Paper Structure

This paper contains 32 sections, 16 theorems, 84 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2

Suppose that $f$ is $L$-smooth and that the stochastic gradients satisfy Assumption assump:alpha-moment for $\alpha \in (1,2]$. Let $\{x_k\}$ be the iterates of GClip with parameters $\eta_k = \eta = \min\{\frac{1}{4L}, \frac{\sigma^\alpha}{L\tau^\alpha} , \frac{1}{24L\tau}\}$ and $\tau_k = \tau = \

Figures (7)

  • Figure 1: (a) Validation loss for ResNet50 trained on ImageNet. SGD momentum outperforms Adam. (b) Histogram of sampled gradient noise for ResNet50 on Imagenet dataset. (c) Histogram of samples from a sum of squared Gaussians. (d) Estimated variance of the stochastic gradient for Resnet50. (e)Validation loss for BERT pretraining. Although hyperparameters for SGD are finetuned, a large performance gap is still observed between SGD and Adam. (f) Histogram of sampled gradient nosie for BERT on Wikipedia+Books dataset. (g) Histogram of samples from a sum of squared $\alpha$-stable random variables. (h) Estimated variance of the stochastic gradient for BERT model.
  • Figure 2: The distribution of gradient noise is non-stationary during BERT training, while it remains almost unchanged for ResNet training on ImageNet.
  • Figure 3: (a) Performance of different algorithms for training a toy transformer-XL model described in Section \ref{['sec:transformerxl']}. (b) Train and (c) validation loss for BERT$_{base}$ pretraining with the sequence length of 128. While there remains a large gap between non-adaptive methods and adaptive methods, clipped SGD momentum achieves faster convergence compared to standard SGD momentum. The proposed algorithm for adaptive coordinate-wise clipping (ACClip) achieves a lower loss than Adam.
  • Figure 4: Distribution of gradient noise norm in Attention and ResNet models on two data sources: Wikipedia and synthetic Gaussian. The heavy-tailed noise pattern results from the interaction of both model architecture as well as data distribution.
  • Figure 5: (a) Noise histogram of AlexNet on ImageNet data at initialization. (b)Noise histogram of AlexNet on ImageNet data at 5k iterations. (c) The per dimension noise distribution within a single minibatch at initialization.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Remark 1: Nonconvergence of SGD
  • proof
  • Theorem 2: Non-convex convergence
  • Remark 3
  • Theorem 4: Strongly-convex convergence
  • Theorem 5
  • Theorem 6
  • Corollary 7: GClip under coordinate-wise noise
  • Theorem 8: CClip under coordinate-wise noise
  • Lemma 9
  • ...and 18 more