How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

Quan Nguyen

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

Quan Nguyen

TL;DR

The paper reframes Adam as Follow-the-Regularized-Leader within an online-to-nonconvex setting and derives general discounted regret bounds that extend beyond the traditional $\beta_1=\sqrt{\beta_2}$ regime. It analyzes two regimes, $\beta_1 \leq \sqrt{\beta_2}$ and $\beta_1 \geq \sqrt{\beta_2}$, providing tight, domain-general regret bounds and showing optimality under oblivious adversaries, while also presenting a non-oblivious construction where $\beta_1=\sqrt{\beta_2}$ is suboptimal. The results imply that practical tuning of $\beta_1$ and $\beta_2$ should account for adversary characteristics, and suggest potential benefits from dynamic momentum strategies. An empirical finding corroborates that $\frac{\beta_1}{\sqrt{\beta_2}}$ near 1 often aligns with strong performance across batch sizes. Overall, the work deepens theoretical understanding of Adam’s momentum parameters and their impact on convergence in online-to-nonconvex settings.

Abstract

While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$. We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

TL;DR

The paper reframes Adam as Follow-the-Regularized-Leader within an online-to-nonconvex setting and derives general discounted regret bounds that extend beyond the traditional

regime. It analyzes two regimes,

and

, providing tight, domain-general regret bounds and showing optimality under oblivious adversaries, while also presenting a non-oblivious construction where

is suboptimal. The results imply that practical tuning of

and

should account for adversary characteristics, and suggest potential benefits from dynamic momentum strategies. An empirical finding corroborates that

near 1 often aligns with strong performance across batch sizes. Overall, the work deepens theoretical understanding of Adam’s momentum parameters and their impact on convergence in online-to-nonconvex settings.

Abstract

While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors,

and

, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting

, which does not cover the more practical cases with

. We derive novel, more general analyses that hold for both

and

. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting

is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

TL;DR

Abstract

How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (14)