Table of Contents
Fetching ...

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

Dimitris Oikonomou, Nicolas Loizou

TL;DR

The paper addresses non-smooth stochastic optimization by introducing Safeguarded Stochastic Polyak Step Sizes (SPS_safe) for the Stochastic Subgradient Method and its momentum variant (IMA_SPS_safe). These adaptive rules use a safeguard M in the denominator to prevent instability from vanishing gradients and eliminate the need for interpolation or exact f_i^* values, while preserving an O(1/√T) convergence rate to a neighborhood of the optimum. The authors extend the analysis to momentum, prove convergence for both Cesàro averages and last iterates, and provide extensive numerical validation on convex problems and deep neural networks, showing faster convergence, reduced variance, and robustness to vanishing gradients. The work also connects safeguarded steps to adaptive gradient clipping and demonstrates practical benefits in DNN training, offering a parameter-free, theoretically grounded alternative to traditional Polyak-type updates in non-smooth settings.

Abstract

The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

TL;DR

The paper addresses non-smooth stochastic optimization by introducing Safeguarded Stochastic Polyak Step Sizes (SPS_safe) for the Stochastic Subgradient Method and its momentum variant (IMA_SPS_safe). These adaptive rules use a safeguard M in the denominator to prevent instability from vanishing gradients and eliminate the need for interpolation or exact f_i^* values, while preserving an O(1/√T) convergence rate to a neighborhood of the optimum. The authors extend the analysis to momentum, prove convergence for both Cesàro averages and last iterates, and provide extensive numerical validation on convex problems and deep neural networks, showing faster convergence, reduced variance, and robustness to vanishing gradients. The work also connects safeguarded steps to adaptive gradient clipping and demonstrates practical benefits in DNN training, offering a parameter-free, theoretically grounded alternative to traditional Polyak-type updates in non-smooth settings.

Abstract

The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.

Paper Structure

This paper contains 50 sections, 12 theorems, 47 equations, 15 figures, 4 tables.

Key Result

proposition 1

eq:ssm with eq:sgd-sps-M and $M=c^2$ is algebraically equivalent to eq:clipped-ssm with the adaptive step size $\tilde{\gamma}_t=\frac{f_i(x^t)-\ell_i^*}{c\max\{c,\|g_i^t\|\}}$.

Figures (15)

  • Figure 1: Comparison of \ref{['eq:spsmax-smooth']} and \ref{['eq:sgd-sps-M']} in the training of ResNet20 in CIFAR-10 (left plot) and CIFAR-100 (right plot).
  • Figure 2: Sensitivity analysis of the safeguarded Polyak step size to the threshold $M$ (Panels \ref{['subfig:pr_sgd_sens']}-\ref{['subfig:pr_ima_sens']} for Phase Retrieval) and comparison against SPS variants (Panels \ref{['subfig:svm_sgd_comp']}–\ref{['subfig:svm_ima_comp']} for SVM).
  • Figure 3: Test accuracy of ResNet20 on CIFAR-10. Left:\ref{['eq:ssm']}-based methods. Right:\ref{['eq:ima']}-based methods.
  • Figure 4: Gradient Norms during training of ResNet20. Left: Trained on CIFAR-10. Right: Trained on CIFAR-100.
  • Figure 5: Test accuracy of ResNet20 on CIFAR-100. Left:\ref{['eq:ssm']}-based methods. Right:\ref{['eq:ima']}-based methods.
  • ...and 10 more figures

Theorems & Definitions (16)

  • proposition 1
  • theorem 2
  • corollary 3: Interpolation
  • corollary 4: Deterministic SSM
  • theorem 5
  • theorem 6
  • Proposition
  • proof
  • Theorem
  • proof : Proof of \ref{['thm:sgd-M-lip']}
  • ...and 6 more