Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

Fabian Schaipp; Robert M. Gower; Adrien Taylor

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

Fabian Schaipp, Robert M. Gower, Adrien Taylor

TL;DR

The paper addresses how stochastic optimization methods behave as the step size grows, introducing a stability index $\delta_t$ that quantifies suboptimality growth with $\alpha$. It develops model-based analyses for SGD, SPS, NGN, and SPP, deriving explicit forms of $\delta_t$ and proving that adaptive methods yield $\delta_t$ no larger than SGD, often scaling more favorably as $\alpha$ increases. This yields new convex/non-smooth convergence insights and explains empirically observed robustness of SPS/NGN/SPP beyond traditional SGD tuning. Experimental results on nonconvex deep learning and convex regression show the theory qualitatively tracks actual performance, validating the practical relevance of the stability framework. The work suggests that monitoring $\delta_t$ could inform early stopping and motivates extending the approach to momentum-based methods and broader problem classes.

Abstract

We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides direct theoretical evidence that adaptive step-size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for nonconvex problems.

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

TL;DR

The paper addresses how stochastic optimization methods behave as the step size grows, introducing a stability index

that quantifies suboptimality growth with

. It develops model-based analyses for SGD, SPS, NGN, and SPP, deriving explicit forms of

and proving that adaptive methods yield

no larger than SGD, often scaling more favorably as

increases. This yields new convex/non-smooth convergence insights and explains empirically observed robustness of SPS/NGN/SPP beyond traditional SGD tuning. Experimental results on nonconvex deep learning and convex regression show the theory qualitatively tracks actual performance, validating the practical relevance of the stability framework. The work suggests that monitoring

could inform early stopping and motivates extending the approach to momentum-based methods and broader problem classes.

Abstract

Paper Structure (40 sections, 9 theorems, 62 equations, 12 figures, 1 table)

This paper contains 40 sections, 9 theorems, 62 equations, 12 figures, 1 table.

Introduction
Summary and contributions.
Limitations.
Setup and background.
Notation.
Related Work
Model-based stochastic optimization.
The issue of learning-rate tuning.
Stochastic proximal point.
Theoretical Analysis
Illustration of \ref{['thm:last-iterate']}.
Model Choices
Linear Model (SGD)
Truncated Model (SPS)
Square-root model (NGN)
...and 25 more sections

Key Result

Lemma 1

[lemma]lem:delta-nonneg If the model satisfies $f_x(x,s) = f(x,s)$, in particular if item:A2 holds, then $\delta_t \geq 0$ holds for any $x_t \in \mathbb{R}^d,s_t \in \mathcal{S}$ and any $\alpha_t>0$.

Figures (12)

Figure 1: Illustration of theory for $f(x, s_t) = \ln(1+\exp(-x)) + \max\{x-2, 0\}$ and $x_t=-3$. (Left) Next-iterate loss as a function of step size $\alpha$. (Right) Stability index $\delta_t$ as a function of $\alpha$. Vertical line marks best SGD step size. For large $\alpha$, stable loss values coincide with benign ($\approx$ sub-linear) increase of $\delta_t$.
Figure 2: (Left) Stability with respect to learning-rate $\alpha$ under different scalings of $\Delta_t$. (Right) Values of $\Lambda(\alpha)$(blue) computed with PEPit, for different Lipschitz constants $G$. We plot the upper bound $f_s(x) -\inf f_s$ for the worst-case instance in (green), and the corresponding $\delta^{{\texttt{SGD}}{}} = \frac{1}{2} \alpha G^2$(grey).
Figure 3: Illustration of the stability index $\delta_t$ for SGD (left) and NGN (right), for the function from \ref{['fig:one']}. Thin colored lines display the objective of the update step \ref{['eqn:update']}. For growing $\alpha$ this illustrates how $\delta_t$ grows much slower for NGN compared to SGD.
Figure 4: ResNet20 on CIFAR10: Actual training loss (solid lines) and value of the bound $\Omega_T^{\text{last}}$, \ref{['thm:last-iterate']}(dashed) with $D=50$. After $T = 20$ epochs, SPS and NGN are more stable than SGD for large $\alpha$ and achieve a smaller loss. This behavior is (qualitatively) reflected in the bound. (Right) Warmup allows SGD to use a larger learning rate. Again, this is reflected in the bound $\Omega_T^{\text{last}}$.
Figure 5: (Left)$\delta_t$ explodes for SGD (without warmup) in the first iterations when $\alpha$ is large. (Right) When using warmup over $100$ steps, the increase of $\delta_t$ is slowed down.
...and 7 more figures

Theorems & Definitions (19)

Lemma 1
Lemma 2
proof
Theorem 3: Average-iterate bound
proof
Theorem 4: Last-iterate bound
proof
Lemma 5
Lemma 6
Lemma 7
...and 9 more

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

TL;DR

Abstract

Step-Size Stability in Stochastic Optimization: A Theoretical Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (19)