Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

Kwangjun Ahn; Zhiyu Zhang; Yunbum Kook; Yan Dai

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, Yan Dai

TL;DR

This work reframes the Adam optimizer through online learning of updates (OLU), showing that Adam is the discounted-FTRL instance of a Follow-the-Regularized-Leader learner. By linking dynamic regret guarantees of Discounted-FTRL to optimization performance, the authors reveal how momentum (updates aggregated over time) and discounting (β<1) contribute to Adam's effectiveness, especially in dynamic or nonstationary environments. The analysis yields dynamic regret bounds and a discounted-to-dynamic conversion, providing theoretical insight into when Adam-like dynamics outperform non-momentum baselines. The results illuminate Adam’s potential advantages for sparse and changing gradients, and suggest a broader framework for analyzing and designing optimizers beyond Adam, with implications for practice and future research.

Abstract

Despite the success of the Adam optimizer in practice, the theoretical understanding of its algorithmic components still remains limited. In particular, most existing analyses of Adam show the convergence rate that can be simply achieved by non-adative algorithms like SGD. In this work, we provide a different perspective based on online learning that underscores the importance of Adam's algorithmic components. Inspired by Cutkosky et al. (2023), we consider the framework called online learning of updates/increments, where we choose the updates/increments of an optimizer based on an online learner. With this framework, the design of a good optimizer is reduced to the design of a good online learner. Our main observation is that Adam corresponds to a principled online learning framework called Follow-the-Regularized-Leader (FTRL). Building on this observation, we study the benefits of its algorithmic components from the online learning perspective.

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

TL;DR

Abstract

Paper Structure (36 sections, 16 theorems, 76 equations, 2 figures)

This paper contains 36 sections, 16 theorems, 76 equations, 2 figures.

Introduction
Our Approach and Main Results
Adam is FTRL in Disguise
Choosing Updates/Increments via Online Learning
Basics of Follow-the-Regularized-Leader (FTRL)
Adam Corresponds to Discounted-FTRL
Comparison with the Previous Approach
Discounted-FTRL as a Dynamic Learner
Basics of Dynamic Online Learning
Benefits of Momentum and Discounting Factor
Proof Sketch of \ref{['thm:dftrl']}
Implications for Optimization
Revisiting lower bound example
Adam Could Be Effective for Sparse and Nonstationary Gradients
Conclusion and Discussion
...and 21 more sections

Key Result

Theorem 2.1

In olu, a better dynamic regret of Learner leads to a better optimization guarantee. Therefore, we want Learner to have a low dynamic regret.

Figures (2)

Figure 1: 1D illustration of the regularized hinge loss $\ell(x) = \max(0,1-x) + \lambda |x|$. We illustrate the case $\lambda=1/4$.
Figure 2: Experimental results for the hinge loss classification. ( Left) the case of ${\mathbf z}_i= {\mathbf e}_i$. ( Right) the case of ${\mathbf z}_i= c_i{\mathbf e}_i$ where $c_i\sim \text{Unif}[0,2]$. The horizontal dotted line indicates the optimum value of $F$. All experiments are run for five different random seeds, and we plot the error shades (they are quite small and not conspicuous).

Theorems & Definitions (19)

Theorem 2.1: Importance of dynamic regret in \ref{['olu']}; see \ref{['thm:guarantee']}
Theorem 2.2: Informal; see Theorems \ref{['thm:dftrl']} and \ref{['thm:dftrl-clip']}
Proposition 2.3: Adam is discounted-FTRL in disguise
Theorem 3.1: Dynamic regret of \ref{['dftrl']}; unbounded domain
Theorem 3.2: Dynamic regret of \ref{['dftrl-clip']}; bounded domain
Theorem 3.3: Lower bounds for baselines
corollary 3.4
corollary 3.5
definition 3.6: $\beta$-discounted regret
Theorem 4.1: Importance of dynamic regret in \ref{['olu']}
...and 9 more

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

TL;DR

Abstract

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)