A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration
Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu
TL;DR
The paper tackles the theoretical gap in convergences for adaptive SGD methods in non-convex stochastic settings by introducing AdaUSM, which fuses Weighted AdaGrad with a Unified Stochastic Momentum that spans Heavy Ball and Nesterov momentum. It develops a general weighted, coordinate-wise learning-rate scheme and proves an $O(\log(T)/\sqrt{T})$ non-asymptotic convergence rate under polynomially growing weights, with broader implications that Adam and RMSProp correspond to specific exponential-weight choices within AdaUSM. The framework unifies existing adaptive methods, provides high-probability guarantees, and yields practical acceleration validated by experiments on multiple deep learning models and datasets. Overall, AdaUSM offers both a theoretical and practical bridge between momentum-based acceleration and adaptive step-size strategies in non-convex stochastic optimization.
Abstract
Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propose \emph{weighted AdaGrad with unified momentum}, dubbed AdaUSM, which has the main characteristics that (1) it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we obtain its $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the non-convex stochastic setting. We also show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, thereby providing a new perspective for understanding Adam and RMSProp. Lastly, comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA, Adam, and AMSGrad on various deep learning models and datasets are also carried out.
