Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Wen-Liang Hwang

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Wen-Liang Hwang

TL;DR

This work analyzes the convergence of stochastic gradient methods with Polyak momentum when step-size and momentum sequences are treated independently, under a strongly convex objective on a compact convex domain with bounded subgradients. It derives mean-squared convergence results for two common step-size regimes: diminishing-to-zero steps with $\sum_j t_j=\infty$ and $\sum_j t_j^2<\infty$, and constant-and-drop multi-stage steps, detailing how momentum decay $\eta_j$ affects rates. For diminishing-to-zero steps, the convergence rate factors into an exponential term in the accumulated step-sizes and a polynomial term in the momentum, with precise rates depending on $\eta_j$ (e.g., $O(1/(N+1))$, $O(\log(N)/ (N+1))$, or $O((N+1)^{-\beta})$ for specific decays). In the constant-and-drop setting, convergence of stagewise suffix-averaged iterates is guaranteed when $\sum_i \eta_i/(N+1) \to 0$, justifying practical momentum schedules that decay at stage boundaries. Overall, the results clarify when and how independent momentum updates can preserve convergence and support commonly used momentum settings in large-scale learning.

Abstract

In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

TL;DR

and

, and constant-and-drop multi-stage steps, detailing how momentum decay

affects rates. For diminishing-to-zero steps, the convergence rate factors into an exponential term in the accumulated step-sizes and a polynomial term in the momentum, with precise rates depending on

(e.g.,

, or

for specific decays). In the constant-and-drop setting, convergence of stagewise suffix-averaged iterates is guaranteed when

, justifying practical momentum schedules that decay at stage boundaries. Overall, the results clarify when and how independent momentum updates can preserve convergence and support commonly used momentum settings in large-scale learning.

Abstract

Paper Structure (12 sections, 5 theorems, 52 equations)

This paper contains 12 sections, 5 theorems, 52 equations.

Introduction
Assumptions and contributions
Related works
Convergence analysis
Diminishing-to-zero step-sizes
Constant-and-drop step-sizes
Conclusions and discussions
Proof of Lemma \ref{['SGcon']}
Proof of Lemma \ref{['sto:rmsthm']}
Proof of Theorem \ref{['dthm']}
Proof of Eq. (\ref{['constantM']})
Proof of Theorem \ref{['piecewisemomlem']}

Key Result

Lemma 1

Suppose Assumptions 1-3 hold. Let $m$ be defined according to Eq. (mstrongly). Using diminishing-to-zero step-size sequence with $\sum_i t_i = \infty$ and $\sum_i t_i^2 < \infty$, the mean-squared error derived by SG using Eq. (sto:stoupdate) converges to zero. Precisely, there exist $j_0$ and $c_0$ If letting step-size sequence be $t_j = \frac{\gamma}{(j+1)^{\alpha}}$ with $\gamma = \frac{1}{m}$

Theorems & Definitions (5)

Lemma 1
Lemma 2
Theorem 3
Theorem 4
Corollary 5

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

TL;DR

Abstract

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (5)