Table of Contents
Fetching ...

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Tianbao Yang, Qihang Lin, Zhe Li

TL;DR

The paper addresses the lack of convergence theory for stochastic momentum methods in convex and non-convex optimization by introducing a Unified Momentum (SUM) framework that encompasses HB, NAG, and gradient methods as special cases. It provides basic convergence guarantees in both settings, showing $O(1/\sqrt{t})$ rates under standard assumptions and clarifying how the momentum parameter $\beta$ and the interpolation parameter $s$ shape constants. The analysis is complemented by empirical results on CIFAR-10/100 demonstrating that SNAG often achieves a favorable balance between training speed and testing stability, consistent with the theoretical insights. Overall, the work offers a theoretical baseline and practical guidance for designing stochastic momentum methods in large-scale learning.

Abstract

Recently, {\it stochastic momentum} methods have been widely adopted in training deep neural networks. However, their convergence analysis is still underexplored at the moment, in particular for non-convex optimization. This paper fills the gap between practice and theory by developing a basic convergence analysis of two stochastic momentum methods, namely stochastic heavy-ball method and the stochastic variant of Nesterov's accelerated gradient method. We hope that the basic convergence results developed in this paper can serve the reference to the convergence of stochastic momentum methods and also serve the baselines for comparison in future development of stochastic momentum methods. The novelty of convergence analysis presented in this paper is a unified framework, revealing more insights about the similarities and differences between different stochastic momentum methods and stochastic gradient method. The unified framework exhibits a continuous change from the gradient method to Nesterov's accelerated gradient method and finally the heavy-ball method incurred by a free parameter, which can help explain a similar change observed in the testing error convergence behavior for deep learning. Furthermore, our empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of convergence in testing error) among the three stochastic methods.

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

TL;DR

The paper addresses the lack of convergence theory for stochastic momentum methods in convex and non-convex optimization by introducing a Unified Momentum (SUM) framework that encompasses HB, NAG, and gradient methods as special cases. It provides basic convergence guarantees in both settings, showing rates under standard assumptions and clarifying how the momentum parameter and the interpolation parameter shape constants. The analysis is complemented by empirical results on CIFAR-10/100 demonstrating that SNAG often achieves a favorable balance between training speed and testing stability, consistent with the theoretical insights. Overall, the work offers a theoretical baseline and practical guidance for designing stochastic momentum methods in large-scale learning.

Abstract

Recently, {\it stochastic momentum} methods have been widely adopted in training deep neural networks. However, their convergence analysis is still underexplored at the moment, in particular for non-convex optimization. This paper fills the gap between practice and theory by developing a basic convergence analysis of two stochastic momentum methods, namely stochastic heavy-ball method and the stochastic variant of Nesterov's accelerated gradient method. We hope that the basic convergence results developed in this paper can serve the reference to the convergence of stochastic momentum methods and also serve the baselines for comparison in future development of stochastic momentum methods. The novelty of convergence analysis presented in this paper is a unified framework, revealing more insights about the similarities and differences between different stochastic momentum methods and stochastic gradient method. The unified framework exhibits a continuous change from the gradient method to Nesterov's accelerated gradient method and finally the heavy-ball method incurred by a free parameter, which can help explain a similar change observed in the testing error convergence behavior for deep learning. Furthermore, our empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of convergence in testing error) among the three stochastic methods.

Paper Structure

This paper contains 14 sections, 8 theorems, 57 equations, 5 figures.

Key Result

Theorem 1

(Convergence of SUM) Suppose $f(\mathbf{x})$ is a convex function, $\mathrm{E}[\|\mathcal{G}(\mathbf{x}; \xi) - \mathrm{E}[\mathcal{G}(\mathbf{x}; \xi)]\|^2]\leq \delta^2$ and $\|\partial f(\mathbf{x})\|\leq G$ for any $\mathbf{x}$. Let update (eqn:um) run for $t$ iterations with $\mathcal{G}(\mathb where $C$ is a postive constant, $\widehat{\mathbf{x}}_{t} = \sum_{k=0}^t\mathbf{x}_k/(t+1)$ and $\

Figures (5)

  • Figure 1: Training and testing error of different methods with the best initial step size on CIFAR-10.
  • Figure 2: Training and testing error of different methods with the same initial step size on CIFAR-10.
  • Figure 3: Training and testing error of different methods with the initial step size $0.001$ on CIFAR-100.
  • Figure 5: Training and testing error of SNAG with different initial step size on CIFAR-100.
  • Figure 6: Training and testing error of SUM with different $s$ on CIFAR-100.

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4