Table of Contents
Fetching ...

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

Zimeng Wang, Alp Yurtsever

TL;DR

This paper introduces a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others, and provides convergence guarantees for many existing momentum methods as special cases.

Abstract

Stochastic gradient descent with momentum (SGDM) methods have become fundamental optimization tools in machine learning, combining the computational efficiency of stochastic gradients with the acceleration benefits of momentum. Despite their widespread use in practice, the theoretical understanding of SGDM remains incomplete, with most existing analyses focusing on specific momentum schemes or requiring restrictive assumptions. In this paper, we introduce a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others. We provide comprehensive convergence analyses for both convex and nonconvex optimization problems under mild smoothness and bounded variance assumptions. For convex problems, we establish general ergodic convergence results with constant parameters and derive improved iterate convergence rates with time-varying parameters. For nonconvex problems, we prove sublinear convergence to stationary points and establish linear convergence to a neighborhood of the optimum under the Polyak--Łojasiewicz condition. Notably, our analysis allows flexible parameter choices and thus provides convergence guarantees for many existing momentum methods as special cases.

Generalized Stochastic Gradient Descent with Momentum Methods for Smooth Optimization

TL;DR

This paper introduces a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others, and provides convergence guarantees for many existing momentum methods as special cases.

Abstract

Stochastic gradient descent with momentum (SGDM) methods have become fundamental optimization tools in machine learning, combining the computational efficiency of stochastic gradients with the acceleration benefits of momentum. Despite their widespread use in practice, the theoretical understanding of SGDM remains incomplete, with most existing analyses focusing on specific momentum schemes or requiring restrictive assumptions. In this paper, we introduce a generalized SGDM framework that unifies a broad class of momentum-based methods, including SGD with Polyak's momentum, SGD with Nesterov's momentum, and many others. We provide comprehensive convergence analyses for both convex and nonconvex optimization problems under mild smoothness and bounded variance assumptions. For convex problems, we establish general ergodic convergence results with constant parameters and derive improved iterate convergence rates with time-varying parameters. For nonconvex problems, we prove sublinear convergence to stationary points and establish linear convergence to a neighborhood of the optimum under the Polyak--Łojasiewicz condition. Notably, our analysis allows flexible parameter choices and thus provides convergence guarantees for many existing momentum methods as special cases.
Paper Structure (11 sections, 10 theorems, 133 equations, 3 figures)

This paper contains 11 sections, 10 theorems, 133 equations, 3 figures.

Key Result

Proposition 1

The iterative scheme alg:g-sgdm encompasses the following momentum-based methods by specifying the parameters $\{\beta_k,\gamma_k,\eta_k\}_{k\ge 1}$ accordingly.

Figures (3)

  • Figure 1: Evolution of optimality gap on the binary logistic regression problem in both full-batch (deterministic) and mini-batch (stochastic) settings.
  • Figure 2: Comparison between NAG-const and SGDM-varying on the binary logistic regression problem with different step size $\gamma>0$. For better interpretability, we display the training loss curves in the early stage (first 1000 iterations) and the late stage (last 10000 iterations).
  • Figure 3: Training ResNet-18 on CIFAR-10 using \ref{['alg:g-sgdm']} with different combinations of constant step sizes $(\gamma, \eta)$ with a fixed momentum parameter $\beta = 0.9$.

Theorems & Definitions (24)

  • Definition 1: Convexity, Lipschitz continuity, and Smoothness
  • Proposition 1
  • proof : Proof of Proposition \ref{['prop:hb&nes&sum']}
  • Lemma 1
  • proof : Proof of Lemma \ref{['lem:w-sgd']}
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:cvx-const']}
  • Theorem 2: Deterministic Case
  • proof : Proof of Theorem \ref{['thm:cvx-const-deter']}
  • Remark 1
  • ...and 14 more