Table of Contents
Fetching ...

Adam Converges Without Any Modification On Update Rules

Yushun Zhang, Bingran Li, Congliang Chen, Zhi-Quan Luo, Ruoyu Sun

TL;DR

This work proves that Adam converges with proper problem-dependent hyperparameters, and indicates a phase transition for Adam from divergence to convergence when changing the $(\beta_1, \beta_2)$ combination.

Abstract

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_2$ is large and $β_1 < \sqrt{β_2}$. Second, when $β_2$ is small, we point out a region of $(β_1,β_2)$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the $(β_1, β_2)$ combination. To our knowledge, this is the first phase transition in $(β_1,β_2)$ 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary $(β_1^*, β_2^*)$ is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune $β_1$ and $β_2$: when Adam does not work well, we suggest tuning up $β_2$ inversely with batch size to surpass the threshold $β_2^*$, and then trying $β_1< \sqrt{β_2}$. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.

Adam Converges Without Any Modification On Update Rules

TL;DR

This work proves that Adam converges with proper problem-dependent hyperparameters, and indicates a phase transition for Adam from divergence to convergence when changing the combination.

Abstract

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., ; while practical applications often fix the problem first and then tune . In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when is large and . Second, when is small, we point out a region of combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the combination. To our knowledge, this is the first phase transition in 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune and : when Adam does not work well, we suggest tuning up inversely with batch size to surpass the threshold , and then trying . Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.
Paper Structure (77 sections, 21 theorems, 225 equations, 8 figures, 2 algorithms)

This paper contains 77 sections, 21 theorems, 225 equations, 8 figures, 2 algorithms.

Key Result

Theorem 2.3

For any fixed $(\beta_1,\beta_2)$ satisfying $\beta_1 < \sqrt{\beta_2}$, there exists a sufficiently large $n$, s.t., applying Adam to the function counterexample_reddi (under cyclic sampling) converges to the sub-optimal point $x=1$.

Figures (8)

  • Figure 1: (a): The divergent region of Adam claimed by reddi2019convergence. They fix $(\beta_1,\beta_2)$ first and then pick a problem to construct the divergence example. (b): An illustration of our contribution in $(\beta_1,\beta_2)$ phase diagram. We fix the problem before picking $(\beta_1,\beta_2)$. Note that this is a different setting from (a), so there is no contradiction. Both boundaries of the red and blue regions depend on batch size (shown later). The shape of the region follows the solution to our analytic conditions. The dotted curve satisfies $\beta_1 =\sqrt{\beta_2}$. (c), (d): The training loss on MNIST and CIFAR-10. We sweep $\beta_1$ and $\beta_2$ in grids $\{(k_1/50,k_2/50)| k_1 = 0,\cdots,49,k_2=0, \cdots, 49\}$, resulting in 2,500 trials. The performance of Adam reconciles with our theoretical characterization in (b).
  • Figure 2: (a, b): We restate Figure 1 from reddi2019convergence. The divergence of Adam happens under both cyclic and update orders, so randomization cannot prevent divergence. Since they consider constrained problems, the term “divergence” here means getting stuck at the sub-optimal solution $x=1$. In the figures, AMSGrad is a different method, and it is not our focus. (c): Diminishing stepsize $\eta_k = \frac{1}{\sqrt{k}}$ does not prevent divergence.
  • Figure 3: (a): The training loss of Adam on MNIST under different batch size and $\beta_2$. The trend aligns with our theory: we need a larger $\beta_2$ when batch size is small. Here, we used the default $\beta_1=0.9$. (b) (c): large-$\beta_2$ Adam converges to a neighborhood of critical points when $D_0>0$ and converges to exact critical points when $D_0 = 0$. We use diminishing stepsize $\eta_k = 0.1 /\sqrt{k}$ as in our theory. Experimental details are shown in Appendix \ref{['appendix:exp_setting']}.
  • Figure 4: On the effect of $\beta_2$ on LLM pre-training from recent literature. (a,b,c): Final validation loss of LLMs trained under different $\beta_2$ and batch size. (c): greener color indicates lower validation loss. (d, e): The optimal $\beta_2$ to train a LLM with 1.2B parameters under different batch size. Here, $\tau$ serves for other training tricks that are independent of our discussion. These results reach a consistent conclusion: Larger $\beta_2$ helps boost performance and shall be tuned up under small batch-size regimes. These results confirm that our theory provides valid guidance for hyperparameter tuning in LLM pre-training.
  • Figure 5: (a): On function \ref{['counterexample1']} with $n=20$ and $a= 1$, Adam diverges in the colored region. The region is plotted by solving condition \ref{['diverge_c1']}, \ref{['diverge_c2']}, \ref{['diverge_c3']} in NumPy. The blue curve satisfies $\beta_1 = \sqrt{\beta_2}$. (b): When $\beta_2$ is small, Adam diverges. We use function \ref{['counterexample1']} with initialization $x =-5$ and $n=20$. The labels in (b) stand for $[\beta_1,\beta_2]$.
  • ...and 3 more figures

Theorems & Definitions (27)

  • Theorem 2.3: Theorem 2 in reddi2019convergence
  • Theorem 2.4: Theorem 1 in reddi2019convergence
  • Theorem 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 3.5
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 6.1
  • ...and 17 more