Table of Contents
Fetching ...

Non-convex Stochastic Composite Optimization with Polyak Momentum

Yuan Gao, Anton Rodomanov, Sebastian U. Stich

TL;DR

This work studies non-convex stochastic composite optimization for $F(\boldsymbol{x})=f(\boldsymbol{x})+\boldsymbol{\psi}(\boldsymbol{x})$ and shows that vanilla stochastic proximal gradient (SPG) cannot converge beyond the gradient-noise floor without large batches. It introduces a Polyak momentum variant of the proximal method, establishes an optimal $O(\varepsilon^{-2})$ convergence rate independent of batch size, and provides a variance-reduction interpretation of momentum in the composite setting. The analysis includes exact and inexact proximal steps, with a Lyapunov framework yielding descent bounds and explicit iteration counts, complemented by experiments on a synthetic quadratic problem and CIFAR-10 with regularization. The results highlight momentum as a robust mechanism to mitigate gradient noise, enabling practical small-batch optimization in non-convex regimes with broader applicability to ML tasks.

Abstract

The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.

Non-convex Stochastic Composite Optimization with Polyak Momentum

TL;DR

This work studies non-convex stochastic composite optimization for and shows that vanilla stochastic proximal gradient (SPG) cannot converge beyond the gradient-noise floor without large batches. It introduces a Polyak momentum variant of the proximal method, establishes an optimal convergence rate independent of batch size, and provides a variance-reduction interpretation of momentum in the composite setting. The analysis includes exact and inexact proximal steps, with a Lyapunov framework yielding descent bounds and explicit iteration counts, complemented by experiments on a synthetic quadratic problem and CIFAR-10 with regularization. The results highlight momentum as a robust mechanism to mitigate gradient noise, enabling practical small-batch optimization in non-convex regimes with broader applicability to ML tasks.

Abstract

The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
Paper Structure (22 sections, 21 theorems, 74 equations, 3 figures, 1 algorithm)

This paper contains 22 sections, 21 theorems, 74 equations, 3 figures, 1 algorithm.

Key Result

Proposition 1

For any $K\geq 1$ and any (predefined) stepsize coefficients $\{M_k\}_{k=0}^{K - 1}$ (possibly depending on the problem parameters $L,\sigma^2$ and $K$), there exists a problem instance of eq:composite with $f(\mathbf{x})\coloneqq \frac{L}{2}\left\lVert\mathbf{x}\right\rVert^2$ and $\psi(\mathbf{x})

Figures (3)

  • Figure 1: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method on the synthetic quadratic problem. For the vanilla stohastic proximal methods, we also highlight the smoothed curves on top of the original curves that oscillate much more. The left, middle, and right figures correspond to $\sigma=5,25,125$, respectively. The vanilla stochastic proximal gradient method uses batch sizes $1, 16, 64$. The x-axis represents the number of gradient samples and is truncated to only show the first $10^5$ gradient samples.
  • Figure 2: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method for the $\ell_{\infty, 1}$ regularized machine learning problem on Cifar-10 dataset, with Resnet-18. The left and right figures correspond to the training loss and test accuracy, respectively.
  • Figure 3: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method for the statistical preconditioning technique on Cifar-10 dataset. The left and right figures correspond to the training loss and test accuracy, respectively.

Theorems & Definitions (33)

  • Proposition 1
  • proof
  • Lemma 1
  • Lemma 2
  • Remark 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • Corollary 5
  • ...and 23 more