Table of Contents
Fetching ...

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

Quan Nguyen

TL;DR

The paper reframes Adam as Follow-the-Regularized-Leader within an online-to-nonconvex setting and derives general discounted regret bounds that extend beyond the traditional $\beta_1=\sqrt{\beta_2}$ regime. It analyzes two regimes, $\beta_1 \leq \sqrt{\beta_2}$ and $\beta_1 \geq \sqrt{\beta_2}$, providing tight, domain-general regret bounds and showing optimality under oblivious adversaries, while also presenting a non-oblivious construction where $\beta_1=\sqrt{\beta_2}$ is suboptimal. The results imply that practical tuning of $\beta_1$ and $\beta_2$ should account for adversary characteristics, and suggest potential benefits from dynamic momentum strategies. An empirical finding corroborates that $\frac{\beta_1}{\sqrt{\beta_2}}$ near 1 often aligns with strong performance across batch sizes. Overall, the work deepens theoretical understanding of Adam’s momentum parameters and their impact on convergence in online-to-nonconvex settings.

Abstract

While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$. We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

TL;DR

The paper reframes Adam as Follow-the-Regularized-Leader within an online-to-nonconvex setting and derives general discounted regret bounds that extend beyond the traditional regime. It analyzes two regimes, and , providing tight, domain-general regret bounds and showing optimality under oblivious adversaries, while also presenting a non-oblivious construction where is suboptimal. The results imply that practical tuning of and should account for adversary characteristics, and suggest potential benefits from dynamic momentum strategies. An empirical finding corroborates that near 1 often aligns with strong performance across batch sizes. Overall, the work deepens theoretical understanding of Adam’s momentum parameters and their impact on convergence in online-to-nonconvex settings.

Abstract

While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, and , remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting , which does not cover the more practical cases with . We derive novel, more general analyses that hold for both and . In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

Paper Structure

This paper contains 11 sections, 9 theorems, 38 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

For any $T \geq 2, \beta_1 \leq \sqrt{\beta_2}$, any sequence $(\alpha_t)_t$ where $\alpha_{t+1} \leq \alpha_t$ and any sequence of $(g_t)_{t=0,\dots,T}$, algo:AdamfromFTRL guarantees

Figures (1)

  • Figure 1: This is Figure 3 in Orvieto2025SearchAdamSecretSauce, demonstrating the empirical results of tuning $\beta_1, \beta_2$ across three batch sizes for training $160M$-parameter transformers. Yellow indicates optimal performances, while dark blue indicates sub-optimal performances. The smallest $\frac{\beta_1}{\sqrt{\beta_2}}$ ratio of a yellow box is approximately $1$, achieved at batch size $256$, $\beta_1 = 0.9$ and $\beta_2 = 0.8$.

Theorems & Definitions (14)

  • Theorem 1
  • Corollary 2
  • Remark 3
  • Remark 4
  • Theorem 5
  • Lemma 6
  • Remark 7
  • Remark 8
  • Theorem 9
  • Remark 10
  • ...and 4 more