Non-convex Stochastic Composite Optimization with Polyak Momentum

Yuan Gao; Anton Rodomanov; Sebastian U. Stich

Non-convex Stochastic Composite Optimization with Polyak Momentum

Yuan Gao, Anton Rodomanov, Sebastian U. Stich

TL;DR

This work studies non-convex stochastic composite optimization for $F(\boldsymbol{x})=f(\boldsymbol{x})+\boldsymbol{\psi}(\boldsymbol{x})$ and shows that vanilla stochastic proximal gradient (SPG) cannot converge beyond the gradient-noise floor without large batches. It introduces a Polyak momentum variant of the proximal method, establishes an optimal $O(\varepsilon^{-2})$ convergence rate independent of batch size, and provides a variance-reduction interpretation of momentum in the composite setting. The analysis includes exact and inexact proximal steps, with a Lyapunov framework yielding descent bounds and explicit iteration counts, complemented by experiments on a synthetic quadratic problem and CIFAR-10 with regularization. The results highlight momentum as a robust mechanism to mitigate gradient noise, enabling practical small-batch optimization in non-convex regimes with broader applicability to ML tasks.

Abstract

The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.

Non-convex Stochastic Composite Optimization with Polyak Momentum

TL;DR

This work studies non-convex stochastic composite optimization for

and shows that vanilla stochastic proximal gradient (SPG) cannot converge beyond the gradient-noise floor without large batches. It introduces a Polyak momentum variant of the proximal method, establishes an optimal

convergence rate independent of batch size, and provides a variance-reduction interpretation of momentum in the composite setting. The analysis includes exact and inexact proximal steps, with a Lyapunov framework yielding descent bounds and explicit iteration counts, complemented by experiments on a synthetic quadratic problem and CIFAR-10 with regularization. The results highlight momentum as a robust mechanism to mitigate gradient noise, enabling practical small-batch optimization in non-convex regimes with broader applicability to ML tasks.

Abstract

Paper Structure (22 sections, 21 theorems, 74 equations, 3 figures, 1 algorithm)

This paper contains 22 sections, 21 theorems, 74 equations, 3 figures, 1 algorithm.

Introduction
Stochastic Proximal Gradient Method
Our Contributions
Related Works
Problem Formulation and Assumptions
Lower Bound for the Vanilla Stochastic Proximal Gradient Method
The Algorithm and Analysis
Convergence Analysis
Initialization and Convergence Guarantees
Variance Reduction Effect of Momentum
Inexact Proximal Step
Experiments
Synthetic Quadratic Problem
Regularized Machine Learning Experiment
Conclusion
...and 7 more sections

Key Result

Proposition 1

For any $K\geq 1$ and any (predefined) stepsize coefficients $\{M_k\}_{k=0}^{K - 1}$ (possibly depending on the problem parameters $L,\sigma^2$ and $K$), there exists a problem instance of eq:composite with $f(\mathbf{x})\coloneqq \frac{L}{2}\left\lVert\mathbf{x}\right\rVert^2$ and $\psi(\mathbf{x})

Figures (3)

Figure 1: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method on the synthetic quadratic problem. For the vanilla stohastic proximal methods, we also highlight the smoothed curves on top of the original curves that oscillate much more. The left, middle, and right figures correspond to $\sigma=5,25,125$, respectively. The vanilla stochastic proximal gradient method uses batch sizes $1, 16, 64$. The x-axis represents the number of gradient samples and is truncated to only show the first $10^5$ gradient samples.
Figure 2: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method for the $\ell_{\infty, 1}$ regularized machine learning problem on Cifar-10 dataset, with Resnet-18. The left and right figures correspond to the training loss and test accuracy, respectively.
Figure 3: Comparison of \ref{['alg:composite-momentum']} and the vanilla stochastic proximal gradient method for the statistical preconditioning technique on Cifar-10 dataset. The left and right figures correspond to the training loss and test accuracy, respectively.

Theorems & Definitions (33)

Proposition 1
proof
Lemma 1
Lemma 2
Remark 3
proof
Lemma 4
proof
Lemma 5
Corollary 5
...and 23 more

Non-convex Stochastic Composite Optimization with Polyak Momentum

TL;DR

Abstract

Non-convex Stochastic Composite Optimization with Polyak Momentum

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (33)