Dynamic Regret via Discounted-to-Dynamic Reduction with Applications to Curved Losses and Adam Optimizer
Yan-Feng Xie, Yu-Jie Zhang, Peng Zhao, Zhi-Hua Zhou
TL;DR
This paper introduces a modular discounted-to-dynamic reduction for online learning with curved losses, enabling dynamic-regret guarantees by leveraging discounted regret templates. It shows that two curved losses, online linear regression and online logistic regression, admit sharp dynamic-regret bounds under the modular framework, and implements a two-layer ensemble to tune discount factors for logistic regression. Extending the reduction to the Adam optimizer via the O2NC framework yields optimal convergence rates in stochastic, non-convex, and non-smooth settings, with flexible parameter choices for $(\beta_1,\beta_2)$ under both clipped and clip-free variants. The results illuminate the role of momentum and second-moment dynamics in non-stationary environments, and provide a unified approach to analyze adaptive optimizers within non-convex online-to-online reductions. Collectively, the work advances theory and practice for dynamic adaptation in curved-loss online learning and non-convex stochastic optimization.
Abstract
We study dynamic regret minimization in non-stationary online learning, with a primary focus on follow-the-regularized-leader (FTRL) methods. FTRL is important for curved losses and for understanding adaptive optimizers such as Adam, yet existing dynamic regret analyses are less explored for FTRL. To address this, we build on the discounted-to-dynamic reduction and present a modular way to obtain dynamic regret bounds of FTRL-related problems. Specifically, we focus on two representative curved losses: linear regression and logistic regression. Our method not only simplifies existing proofs for the optimal dynamic regret of online linear regression, but also yields new dynamic regret guarantees for online logistic regression. Beyond online convex optimization, we apply the reduction to analyze the Adam optimizers, obtaining optimal convergence rates in stochastic, non-convex, and non-smooth settings. The reduction also enables a more detailed treatment of Adam with two discount parameters $(β_1,β_2)$, leading to new results for both clipped and clip-free variants of Adam optimizers.
