Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator
Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, Tianbao Yang
TL;DR
This work develops a unified convergence framework for adaptive optimization using a moving-average gradient estimator (SEMA) and bounded step-size scaling. It shows that an increasing or large momentum parameter, $1-\beta_t=\gamma_t$, yields convergence for Adam-family methods in non-convex minimization, with a variance-reduction property that drives $\mathbb{E}[\frac{1}{T+1}\sum_{t=0}^T ||\nabla F(x_t)||^2] = O(1/\sqrt{T})$ and oracle complexity $O(\frac{1}{\epsilon^4})$, while a PL-condition enables a faster double-loop rate $\tilde{O}(\frac{1}{\mu^2 \epsilon})$. Extending to non-convex strongly-concave min-max problems, the paper introduces primal-dual stochastic momentum (PDSM) and adaptive (PDAda) methods that achieve $O(\kappa^{3/2}/\sqrt{T})$ progress in the primal and $O(\kappa^{3}/\epsilon^4)$ sample complexity with a single loop and $O(1)$ minibatch, outperforming several baselines. The framework also supports non-convex bilevel optimization, with SMB and SBMA variants offering comparable convergence with weaker assumptions and without large minibatches or double loops. Empirical results on standard vision and molecule datasets corroborate the theory, showing faster and more stable convergence with moving-average estimators and adaptive momentum.
Abstract
Although adaptive optimization algorithms have been successful in many applications, there are still some mysteries in terms of convergence analysis that have not been unraveled. This paper provides a novel non-convex analysis of adaptive optimization to uncover some of these mysteries. Our contributions are three-fold. First, we show that an increasing or large enough momentum parameter for the first-order moment used in practice is sufficient to ensure the convergence of adaptive algorithms whose adaptive scaling factors of the step size are bounded. Second, our analysis gives insights for practical implementations, e.g., increasing the momentum parameter in a stage-wise manner in accordance with stagewise decreasing step size would help improve the convergence. Third, the modular nature of our analysis allows its extension to solving other optimization problems, e.g., compositional, min-max and bilevel problems. As an interesting yet non-trivial use case, we present algorithms for solving non-convex min-max optimization and bilevel optimization that do not require using large batches of data to estimate gradients or double loops as the literature do. Our empirical studies corroborate our theoretical results.
