How to Set $β_1, β_2$ in Adam: An Online Learning Perspective
Quan Nguyen
TL;DR
The paper reframes Adam as Follow-the-Regularized-Leader within an online-to-nonconvex setting and derives general discounted regret bounds that extend beyond the traditional $\beta_1=\sqrt{\beta_2}$ regime. It analyzes two regimes, $\beta_1 \leq \sqrt{\beta_2}$ and $\beta_1 \geq \sqrt{\beta_2}$, providing tight, domain-general regret bounds and showing optimality under oblivious adversaries, while also presenting a non-oblivious construction where $\beta_1=\sqrt{\beta_2}$ is suboptimal. The results imply that practical tuning of $\beta_1$ and $\beta_2$ should account for adversary characteristics, and suggest potential benefits from dynamic momentum strategies. An empirical finding corroborates that $\frac{\beta_1}{\sqrt{\beta_2}}$ near 1 often aligns with strong performance across batch sizes. Overall, the work deepens theoretical understanding of Adam’s momentum parameters and their impact on convergence in online-to-nonconvex settings.
Abstract
While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$. We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.
