Table of Contents
Fetching ...

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Yan Pan, Yuanzhi Li

TL;DR

The paper investigates why Adam often converges faster than SGD for transformer training by introducing directional sharpness, the curvature of the objective along an update direction defined as $v^\top \nabla^2 f(x) v$. It shows SGD exhibits much higher directional sharpness than adaptive methods, especially due to imbalanced gradient-Hessian coordinates, which limits feasible step sizes. Coordinate-wise clipping is proposed as a simple, universal technique to reduce directional sharpness and accelerate convergence across various optimizers, with theoretical intuition and empirical demonstrations on machine translation and autoregressive modeling. The findings suggest that the adaptive, per-coordinate scaling in Adam helps to control sharpness, and clipping can further speed up training by flattening update directions, guiding future algorithm design with local-geometry diagnostics.

Abstract

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question that why Adam converges significantly faster than SGD in these scenarios. In this paper, we propose one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinate-wise clipping on sharpness reduction and speeding up the convergence of optimization algorithms under various settings. We show that coordinate-wise clipping improves the local loss reduction when only a small fraction of the coordinates has bad sharpness. We conclude that the sharpness reduction effect of adaptive coordinate-wise scaling is the reason for Adam's success in practice and suggest the use of coordinate-wise clipping as a universal technique to speed up deep learning optimization.

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

TL;DR

The paper investigates why Adam often converges faster than SGD for transformer training by introducing directional sharpness, the curvature of the objective along an update direction defined as . It shows SGD exhibits much higher directional sharpness than adaptive methods, especially due to imbalanced gradient-Hessian coordinates, which limits feasible step sizes. Coordinate-wise clipping is proposed as a simple, universal technique to reduce directional sharpness and accelerate convergence across various optimizers, with theoretical intuition and empirical demonstrations on machine translation and autoregressive modeling. The findings suggest that the adaptive, per-coordinate scaling in Adam helps to control sharpness, and clipping can further speed up training by flattening update directions, guiding future algorithm design with local-geometry diagnostics.

Abstract

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question that why Adam converges significantly faster than SGD in these scenarios. In this paper, we propose one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinate-wise clipping on sharpness reduction and speeding up the convergence of optimization algorithms under various settings. We show that coordinate-wise clipping improves the local loss reduction when only a small fraction of the coordinates has bad sharpness. We conclude that the sharpness reduction effect of adaptive coordinate-wise scaling is the reason for Adam's success in practice and suggest the use of coordinate-wise clipping as a universal technique to speed up deep learning optimization.
Paper Structure (25 sections, 2 theorems, 17 equations, 16 figures, 9 tables, 9 algorithms)

This paper contains 25 sections, 2 theorems, 17 equations, 16 figures, 9 tables, 9 algorithms.

Key Result

Theorem 1

Suppose $f$ is non-convex and $L$-smooth, and there exists $0 < \varepsilon < 1$ and $\ell \ll L$ such that for every $x$, after removing $\varepsilon$-fraction of the coordinates, the remaining Hessian has spectral norm at most $\ell$. Then, in the worst case, if we run SGD clipping with some optim

Figures (16)

  • Figure 1: Histogram of update step distribution over coordinates for SGD, Adam, and Adafactor on machine translation.
  • Figure 2: The loss landscape in different update directions on machine translation in SGD geometry. The step size is the learning rate normalized by the update step $\ell_2$ norm. The plots of clipped and unclipped variants of the same algorithm have the same color with different opacity.
  • Figure 3: The loss landscape in different update directions on machine translation in Adam geometry.
  • Figure 4: The loss landscape in different update directions on autoregressive language modeling in SGD geometry.
  • Figure 5: SGD momentum with clipping
  • ...and 11 more figures

Theorems & Definitions (3)

  • Theorem 1: informal
  • Theorem 2: Gradient descent lemma
  • proof