Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Kejie Tang; Weidong Liu; Yichen Zhang; Xi Chen

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Kejie Tang, Weidong Liu, Yichen Zhang, Xi Chen

TL;DR

This work analyzes stochastic gradient descent with momentum (SGDM) under strong convexity, establishing nonasymptotic convergence rates for both quadratic and general losses and showing that momentum can accelerate convergence with large mini-batch sizes. It introduces Polyak averaging for SGDM, proving asymptotic normality of the averaged iterates and demonstrating an asymptotic equivalence to averaged SGD, which enables principled uncertainty quantification and statistical inference for model parameters. The results specify optimal learning-rate and momentum choices, reveal improved robustness to hyperparameters, and provide high-probability bounds for general losses. Empirical experiments on quadratic, logistic, and MNIST-scale problems validate the theory and illustrate practical gains and confidence-interval construction for SGDM-based optimization.

Abstract

Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

TL;DR

Abstract

Paper Structure (29 sections, 21 theorems, 156 equations, 13 figures)

This paper contains 29 sections, 21 theorems, 156 equations, 13 figures.

Introduction
Related Works
Preliminaries
Finite-sample Convergence Rates for SGDM
The finite-sample rates for SGDM on quadratic losses
Linear convergence to a local neighborhood
Explicit convergence rates for SGDM with specified $\gamma$ and $\alpha$
Comparison to SGD.
Comparison to existing results for SGDM.
The finite-sample rates for SGDM on general losses
Acceleration by Averaging and Asymptotic Normality
Averaged SGDM under quadratic losses
Averaged SGDM under general losses
Experiments
Simulation: quadratic loss
...and 14 more sections

Key Result

Theorem 1

Under (A1)-(A3) and $\overline{L}=0$, for any momentum $\gamma\in[0,1)$ and fixed $\delta\in(0,1]$, assume the learning rate $\alpha>0$ satisfies $\alpha L<2(1+\gamma)/(1-\gamma)$ and $16 M^2 \alpha^2 L_f^2\leq B\delta\lambda^{2(1-\delta)}(1-\lambda)$, where and $\lambda$ is the spectral radius of the matrix Let $\widetilde{m}_{t+1} =(1-\gamma) \sum_{j=1}^{t}\gamma^{t-j} \Sigma (x_j-x^*)$, we ha

Figures (13)

Figure 1: The spectral radius of $\Gamma$ with respect to $\gamma$ and $\alpha$ in Theorem \ref{['thm:lambda']}, where $L/\mu=5/1$.
Figure 2: Performance of SGD and SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.
Figure 3: Performance of Averaged SGD and Averaged SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.
Figure 4: Performance of Averaged SGD and Averaged SGDM on the quadratic loss with $n_0=500$.
Figure 5: Frequency of $Y$ about Averaged SGD and Averaged SGDM with $\gamma=0.9$.
...and 8 more figures

Theorems & Definitions (41)

Example 1
Theorem 1
Remark 2
Theorem 3
Remark 4
Remark 5
Corollary 6: Small momentum weight $\gamma$
Corollary 7: Large momentum weight $\gamma$
Remark 8
Theorem 9
...and 31 more

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

TL;DR

Abstract

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (41)