Table of Contents
Fetching ...

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

Fabian Schaipp, Robert M. Gower, Adrien Taylor

TL;DR

The paper addresses how stochastic optimization methods behave as the step size grows, introducing a stability index $\delta_t$ that quantifies suboptimality growth with $\alpha$. It develops model-based analyses for SGD, SPS, NGN, and SPP, deriving explicit forms of $\delta_t$ and proving that adaptive methods yield $\delta_t$ no larger than SGD, often scaling more favorably as $\alpha$ increases. This yields new convex/non-smooth convergence insights and explains empirically observed robustness of SPS/NGN/SPP beyond traditional SGD tuning. Experimental results on nonconvex deep learning and convex regression show the theory qualitatively tracks actual performance, validating the practical relevance of the stability framework. The work suggests that monitoring $\delta_t$ could inform early stopping and motivates extending the approach to momentum-based methods and broader problem classes.

Abstract

We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides direct theoretical evidence that adaptive step-size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for nonconvex problems.

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

TL;DR

The paper addresses how stochastic optimization methods behave as the step size grows, introducing a stability index that quantifies suboptimality growth with . It develops model-based analyses for SGD, SPS, NGN, and SPP, deriving explicit forms of and proving that adaptive methods yield no larger than SGD, often scaling more favorably as increases. This yields new convex/non-smooth convergence insights and explains empirically observed robustness of SPS/NGN/SPP beyond traditional SGD tuning. Experimental results on nonconvex deep learning and convex regression show the theory qualitatively tracks actual performance, validating the practical relevance of the stability framework. The work suggests that monitoring could inform early stopping and motivates extending the approach to momentum-based methods and broader problem classes.

Abstract

We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides direct theoretical evidence that adaptive step-size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for nonconvex problems.
Paper Structure (40 sections, 9 theorems, 62 equations, 12 figures, 1 table)

This paper contains 40 sections, 9 theorems, 62 equations, 12 figures, 1 table.

Key Result

Lemma 1

[lemma]lem:delta-nonneg If the model satisfies $f_x(x,s) = f(x,s)$, in particular if item:A2 holds, then $\delta_t \geq 0$ holds for any $x_t \in \mathbb{R}^d,s_t \in \mathcal{S}$ and any $\alpha_t>0$.

Figures (12)

  • Figure 1: Illustration of theory for $f(x, s_t) = \ln(1+\exp(-x)) + \max\{x-2, 0\}$ and $x_t=-3$. (Left) Next-iterate loss as a function of step size $\alpha$. (Right) Stability index $\delta_t$ as a function of $\alpha$. Vertical line marks best SGD step size. For large $\alpha$, stable loss values coincide with benign ($\approx$ sub-linear) increase of $\delta_t$.
  • Figure 2: (Left) Stability with respect to learning-rate $\alpha$ under different scalings of $\Delta_t$. (Right) Values of $\Lambda(\alpha)$(blue) computed with PEPit, for different Lipschitz constants $G$. We plot the upper bound $f_s(x) -\inf f_s$ for the worst-case instance in (green), and the corresponding $\delta^{{\texttt{SGD}}{}} = \frac{1}{2} \alpha G^2$(grey).
  • Figure 3: Illustration of the stability index $\delta_t$ for SGD (left) and NGN (right), for the function from \ref{['fig:one']}. Thin colored lines display the objective of the update step \ref{['eqn:update']}. For growing $\alpha$ this illustrates how $\delta_t$ grows much slower for NGN compared to SGD.
  • Figure 4: ResNet20 on CIFAR10: Actual training loss (solid lines) and value of the bound $\Omega_T^{\text{last}}$, \ref{['thm:last-iterate']}(dashed) with $D=50$. After $T = 20$ epochs, SPS and NGN are more stable than SGD for large $\alpha$ and achieve a smaller loss. This behavior is (qualitatively) reflected in the bound. (Right) Warmup allows SGD to use a larger learning rate. Again, this is reflected in the bound $\Omega_T^{\text{last}}$.
  • Figure 5: (Left)$\delta_t$ explodes for SGD (without warmup) in the first iterations when $\alpha$ is large. (Right) When using warmup over $100$ steps, the increase of $\delta_t$ is slowed down.
  • ...and 7 more figures

Theorems & Definitions (19)

  • Lemma 1
  • Lemma 2
  • proof
  • Theorem 3: Average-iterate bound
  • proof
  • Theorem 4: Last-iterate bound
  • proof
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • ...and 9 more