Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise
Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, Yan Dai
TL;DR
This work reframes the Adam optimizer through online learning of updates (OLU), showing that Adam is the discounted-FTRL instance of a Follow-the-Regularized-Leader learner. By linking dynamic regret guarantees of Discounted-FTRL to optimization performance, the authors reveal how momentum (updates aggregated over time) and discounting (β<1) contribute to Adam's effectiveness, especially in dynamic or nonstationary environments. The analysis yields dynamic regret bounds and a discounted-to-dynamic conversion, providing theoretical insight into when Adam-like dynamics outperform non-momentum baselines. The results illuminate Adam’s potential advantages for sparse and changing gradients, and suggest a broader framework for analyzing and designing optimizers beyond Adam, with implications for practice and future research.
Abstract
Despite the success of the Adam optimizer in practice, the theoretical understanding of its algorithmic components still remains limited. In particular, most existing analyses of Adam show the convergence rate that can be simply achieved by non-adative algorithms like SGD. In this work, we provide a different perspective based on online learning that underscores the importance of Adam's algorithmic components. Inspired by Cutkosky et al. (2023), we consider the framework called online learning of updates/increments, where we choose the updates/increments of an optimizer based on an online learner. With this framework, the design of a good optimizer is reduced to the design of a good online learner. Our main observation is that Adam corresponds to a principled online learning framework called Follow-the-Regularized-Leader (FTRL). Building on this observation, we study the benefits of its algorithmic components from the online learning perspective.
