Table of Contents
Fetching ...

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Kejie Tang, Weidong Liu, Yichen Zhang, Xi Chen

TL;DR

This work analyzes stochastic gradient descent with momentum (SGDM) under strong convexity, establishing nonasymptotic convergence rates for both quadratic and general losses and showing that momentum can accelerate convergence with large mini-batch sizes. It introduces Polyak averaging for SGDM, proving asymptotic normality of the averaged iterates and demonstrating an asymptotic equivalence to averaged SGD, which enables principled uncertainty quantification and statistical inference for model parameters. The results specify optimal learning-rate and momentum choices, reveal improved robustness to hyperparameters, and provide high-probability bounds for general losses. Empirical experiments on quadratic, logistic, and MNIST-scale problems validate the theory and illustrate practical gains and confidence-interval construction for SGDM-based optimization.

Abstract

Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

TL;DR

This work analyzes stochastic gradient descent with momentum (SGDM) under strong convexity, establishing nonasymptotic convergence rates for both quadratic and general losses and showing that momentum can accelerate convergence with large mini-batch sizes. It introduces Polyak averaging for SGDM, proving asymptotic normality of the averaged iterates and demonstrating an asymptotic equivalence to averaged SGD, which enables principled uncertainty quantification and statistical inference for model parameters. The results specify optimal learning-rate and momentum choices, reveal improved robustness to hyperparameters, and provide high-probability bounds for general losses. Empirical experiments on quadratic, logistic, and MNIST-scale problems validate the theory and illustrate practical gains and confidence-interval construction for SGDM-based optimization.

Abstract

Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.
Paper Structure (29 sections, 21 theorems, 156 equations, 13 figures)

This paper contains 29 sections, 21 theorems, 156 equations, 13 figures.

Key Result

Theorem 1

Under (A1)-(A3) and $\overline{L}=0$, for any momentum $\gamma\in[0,1)$ and fixed $\delta\in(0,1]$, assume the learning rate $\alpha>0$ satisfies $\alpha L<2(1+\gamma)/(1-\gamma)$ and $16 M^2 \alpha^2 L_f^2\leq B\delta\lambda^{2(1-\delta)}(1-\lambda)$, where and $\lambda$ is the spectral radius of the matrix Let $\widetilde{m}_{t+1} =(1-\gamma) \sum_{j=1}^{t}\gamma^{t-j} \Sigma (x_j-x^*)$, we ha

Figures (13)

  • Figure 1: The spectral radius of $\Gamma$ with respect to $\gamma$ and $\alpha$ in Theorem \ref{['thm:lambda']}, where $L/\mu=5/1$.
  • Figure 2: Performance of SGD and SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.
  • Figure 3: Performance of Averaged SGD and Averaged SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.
  • Figure 4: Performance of Averaged SGD and Averaged SGDM on the quadratic loss with $n_0=500$.
  • Figure 5: Frequency of $Y$ about Averaged SGD and Averaged SGDM with $\gamma=0.9$.
  • ...and 8 more figures

Theorems & Definitions (41)

  • Example 1
  • Theorem 1
  • Remark 2
  • Theorem 3
  • Remark 4
  • Remark 5
  • Corollary 6: Small momentum weight $\gamma$
  • Corollary 7: Large momentum weight $\gamma$
  • Remark 8
  • Theorem 9
  • ...and 31 more