Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight
Wen-Liang Hwang
TL;DR
This work analyzes the convergence of stochastic gradient methods with Polyak momentum when step-size and momentum sequences are treated independently, under a strongly convex objective on a compact convex domain with bounded subgradients. It derives mean-squared convergence results for two common step-size regimes: diminishing-to-zero steps with $\sum_j t_j=\infty$ and $\sum_j t_j^2<\infty$, and constant-and-drop multi-stage steps, detailing how momentum decay $\eta_j$ affects rates. For diminishing-to-zero steps, the convergence rate factors into an exponential term in the accumulated step-sizes and a polynomial term in the momentum, with precise rates depending on $\eta_j$ (e.g., $O(1/(N+1))$, $O(\log(N)/ (N+1))$, or $O((N+1)^{-\beta})$ for specific decays). In the constant-and-drop setting, convergence of stagewise suffix-averaged iterates is guaranteed when $\sum_i \eta_i/(N+1) \to 0$, justifying practical momentum schedules that decay at stage boundaries. Overall, the results clarify when and how independent momentum updates can preserve convergence and support commonly used momentum settings in large-scale learning.
Abstract
In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.
