Non-convex Stochastic Composite Optimization with Polyak Momentum
Yuan Gao, Anton Rodomanov, Sebastian U. Stich
TL;DR
This work studies non-convex stochastic composite optimization for $F(\boldsymbol{x})=f(\boldsymbol{x})+\boldsymbol{\psi}(\boldsymbol{x})$ and shows that vanilla stochastic proximal gradient (SPG) cannot converge beyond the gradient-noise floor without large batches. It introduces a Polyak momentum variant of the proximal method, establishes an optimal $O(\varepsilon^{-2})$ convergence rate independent of batch size, and provides a variance-reduction interpretation of momentum in the composite setting. The analysis includes exact and inexact proximal steps, with a Lyapunov framework yielding descent bounds and explicit iteration counts, complemented by experiments on a synthetic quadratic problem and CIFAR-10 with regularization. The results highlight momentum as a robust mechanism to mitigate gradient noise, enabling practical small-batch optimization in non-convex regimes with broader applicability to ML tasks.
Abstract
The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
