Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

Zimeng Wang; Alp Yurtsever

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

Zimeng Wang, Alp Yurtsever

TL;DR

This paper introduces a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others, and provides convergence guarantees for many existing momentum methods as special cases.

Abstract

Stochastic gradient descent with momentum (SGDM) methods have become fundamental optimization tools in machine learning, combining the computational efficiency of stochastic gradients with the acceleration benefits of momentum. Despite their widespread use in practice, the theoretical understanding of SGDM remains incomplete, with most existing analyses focusing on specific momentum schemes or requiring restrictive assumptions. In this paper, we introduce a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others. We provide comprehensive convergence analyses for both convex and nonconvex optimization problems under mild smoothness and bounded variance assumptions. For convex problems, we establish general ergodic convergence results with constant parameters and derive improved iterate convergence rates with time-varying parameters. For nonconvex problems, we prove sublinear convergence to stationary points and establish linear convergence to a neighborhood of the optimum under the Polyak--Łojasiewicz condition. Notably, our analysis allows flexible parameter choices and thus provides convergence guarantees for many existing momentum methods as special cases.

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

TL;DR

Abstract

Paper Structure (11 sections, 10 theorems, 133 equations, 3 figures)

This paper contains 11 sections, 10 theorems, 133 equations, 3 figures.

Introduction
Preliminary
Generalized SGDM Method
Convergence for Convex Problems
Convergence Results with Constant Parameters
Improved Convergence Results with Time-varying Parameters
Convergence for Nonconvex Problems
Experiments
Logistic Regression
Image Classification
Conclusion

Key Result

Proposition 1

The iterative scheme alg:g-sgdm encompasses the following momentum-based methods by specifying the parameters $\{\beta_k,\gamma_k,\eta_k\}_{k\ge 1}$ accordingly.

Figures (3)

Figure 1: Evolution of optimality gap on the binary logistic regression problem in both full-batch (deterministic) and mini-batch (stochastic) settings.
Figure 2: Comparison between NAG-const and SGDM-varying on the binary logistic regression problem with different step size $\gamma>0$. For better interpretability, we display the training loss curves in the early stage (first 1000 iterations) and the late stage (last 10000 iterations).
Figure 3: Training ResNet-18 on CIFAR-10 using \ref{['alg:g-sgdm']} with different combinations of constant step sizes $(\gamma, \eta)$ with a fixed momentum parameter $\beta = 0.9$.

Theorems & Definitions (24)

Definition 1: Convexity, Lipschitz continuity, and Smoothness
Proposition 1
proof : Proof of Proposition \ref{['prop:hb&nes&sum']}
Lemma 1
proof : Proof of Lemma \ref{['lem:w-sgd']}
Theorem 1
proof : Proof of Theorem \ref{['thm:cvx-const']}
Theorem 2: Deterministic Case
proof : Proof of Theorem \ref{['thm:cvx-const-deter']}
Remark 1
...and 14 more

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

TL;DR

Abstract

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)