Table of Contents
Fetching ...

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, Yan Dai

TL;DR

This work reframes the Adam optimizer through online learning of updates (OLU), showing that Adam is the discounted-FTRL instance of a Follow-the-Regularized-Leader learner. By linking dynamic regret guarantees of Discounted-FTRL to optimization performance, the authors reveal how momentum (updates aggregated over time) and discounting (β<1) contribute to Adam's effectiveness, especially in dynamic or nonstationary environments. The analysis yields dynamic regret bounds and a discounted-to-dynamic conversion, providing theoretical insight into when Adam-like dynamics outperform non-momentum baselines. The results illuminate Adam’s potential advantages for sparse and changing gradients, and suggest a broader framework for analyzing and designing optimizers beyond Adam, with implications for practice and future research.

Abstract

Despite the success of the Adam optimizer in practice, the theoretical understanding of its algorithmic components still remains limited. In particular, most existing analyses of Adam show the convergence rate that can be simply achieved by non-adative algorithms like SGD. In this work, we provide a different perspective based on online learning that underscores the importance of Adam's algorithmic components. Inspired by Cutkosky et al. (2023), we consider the framework called online learning of updates/increments, where we choose the updates/increments of an optimizer based on an online learner. With this framework, the design of a good optimizer is reduced to the design of a good online learner. Our main observation is that Adam corresponds to a principled online learning framework called Follow-the-Regularized-Leader (FTRL). Building on this observation, we study the benefits of its algorithmic components from the online learning perspective.

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

TL;DR

This work reframes the Adam optimizer through online learning of updates (OLU), showing that Adam is the discounted-FTRL instance of a Follow-the-Regularized-Leader learner. By linking dynamic regret guarantees of Discounted-FTRL to optimization performance, the authors reveal how momentum (updates aggregated over time) and discounting (β<1) contribute to Adam's effectiveness, especially in dynamic or nonstationary environments. The analysis yields dynamic regret bounds and a discounted-to-dynamic conversion, providing theoretical insight into when Adam-like dynamics outperform non-momentum baselines. The results illuminate Adam’s potential advantages for sparse and changing gradients, and suggest a broader framework for analyzing and designing optimizers beyond Adam, with implications for practice and future research.

Abstract

Despite the success of the Adam optimizer in practice, the theoretical understanding of its algorithmic components still remains limited. In particular, most existing analyses of Adam show the convergence rate that can be simply achieved by non-adative algorithms like SGD. In this work, we provide a different perspective based on online learning that underscores the importance of Adam's algorithmic components. Inspired by Cutkosky et al. (2023), we consider the framework called online learning of updates/increments, where we choose the updates/increments of an optimizer based on an online learner. With this framework, the design of a good optimizer is reduced to the design of a good online learner. Our main observation is that Adam corresponds to a principled online learning framework called Follow-the-Regularized-Leader (FTRL). Building on this observation, we study the benefits of its algorithmic components from the online learning perspective.
Paper Structure (36 sections, 16 theorems, 76 equations, 2 figures)

This paper contains 36 sections, 16 theorems, 76 equations, 2 figures.

Key Result

Theorem 2.1

In olu, a better dynamic regret of Learner leads to a better optimization guarantee. Therefore, we want Learner to have a low dynamic regret.

Figures (2)

  • Figure 1: 1D illustration of the regularized hinge loss $\ell(x) = \max(0,1-x) + \lambda |x|$. We illustrate the case $\lambda=1/4$.
  • Figure 2: Experimental results for the hinge loss classification. ( Left) the case of ${\mathbf z}_i= {\mathbf e}_i$. ( Right) the case of ${\mathbf z}_i= c_i{\mathbf e}_i$ where $c_i\sim \text{Unif}[0,2]$. The horizontal dotted line indicates the optimum value of $F$. All experiments are run for five different random seeds, and we plot the error shades (they are quite small and not conspicuous).

Theorems & Definitions (19)

  • Theorem 2.1: Importance of dynamic regret in \ref{['olu']}; see \ref{['thm:guarantee']}
  • Theorem 2.2: Informal; see Theorems \ref{['thm:dftrl']} and \ref{['thm:dftrl-clip']}
  • Proposition 2.3: Adam is discounted-FTRL in disguise
  • Theorem 3.1: Dynamic regret of \ref{['dftrl']}; unbounded domain
  • Theorem 3.2: Dynamic regret of \ref{['dftrl-clip']}; bounded domain
  • Theorem 3.3: Lower bounds for baselines
  • corollary 3.4
  • corollary 3.5
  • definition 3.6: $\beta$-discounted regret
  • Theorem 4.1: Importance of dynamic regret in \ref{['olu']}
  • ...and 9 more