Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Frederik Köhne; Leonie Kreis; Anton Schiela; Roland Herzog

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Frederik Köhne, Leonie Kreis, Anton Schiela, Roland Herzog

TL;DR

This work addresses the challenge of choosing learning rates in SGD by tying step sizes to computable, locally observable quantities: the gradient Lipschitz constant $L$ and the local variance of the search direction. It proposes a step size rule that combines estimates of nonlinearity and stochasticity, formulated in a Hilbert space with optional preconditioning, and proves convergence guarantees that adapt to both interpolating and non-interpolating regimes. The approach relies on estimators for $L$ and variance that are obtainable during SGD with an additional forward pass per minibatch, enabling near hyperparameter-free optimization. Numerical results on quadratic problems and standard image classification benchmarks demonstrate robust, problem-adaptive behavior across diverse settings, with minimal per-iteration overhead and practical safeguards for nonconvex landscapes.

Abstract

This paper proposes a novel approach to adaptive step sizes in stochastic gradient descent (SGD) by utilizing quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions. Our findings yield a nearly hyperparameter-free algorithm for stochastic optimization, which has provable convergence properties and exhibits truly problem adaptive behavior on classical image classification tasks. Our framework is set in a general Hilbert space and thus enables the potential inclusion of a preconditioner through the choice of the inner product.

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

TL;DR

This work addresses the challenge of choosing learning rates in SGD by tying step sizes to computable, locally observable quantities: the gradient Lipschitz constant

and the local variance of the search direction. It proposes a step size rule that combines estimates of nonlinearity and stochasticity, formulated in a Hilbert space with optional preconditioning, and proves convergence guarantees that adapt to both interpolating and non-interpolating regimes. The approach relies on estimators for

and variance that are obtainable during SGD with an additional forward pass per minibatch, enabling near hyperparameter-free optimization. Numerical results on quadratic problems and standard image classification benchmarks demonstrate robust, problem-adaptive behavior across diverse settings, with minimal per-iteration overhead and practical safeguards for nonconvex landscapes.

Abstract

Paper Structure (37 sections, 13 theorems, 71 equations, 6 figures, 1 table, 5 algorithms)

This paper contains 37 sections, 13 theorems, 71 equations, 6 figures, 1 table, 5 algorithms.

Introduction
Known Adaptive Step Size Strategies
Polyak-Type Strategies
Line Search Strategies
Diagonal Scaling Methods
Trust Region Methods
Variance in the Search Direction
Noise at the Minimizer
Our Contribution
Outline
Problem Setting
SGD Descent Analysis
Problems Arising
Asymptotic Behavior of the Variance
Variance Bounds Independent of Convexity
...and 22 more sections

Key Result

lemma 1

Let $(f_\xi, \Omega, P)$ be a $(\mu, L)$-feasible SOP such that $f_\xi$ is $L_\xi$-smooth for some measurable function $\xi \mapsto L_\xi$. Then the variance assumption eq:var_assumption holds with

Figures (6)

Figure 2.1: A step size $\sim \mu$ is too conservative. The figure shows a comparison of different step sizes, in dependency of the convexity parameter $\mu$ for the example in the proof of proposition:unbounded-variation-bound-Pstar. SGD's relative progress is plotted, with higher values indicating better performance. According to the theory presented in section:problem-setting, a step size of $\frac{1}{L \, (1 + V_1)}$ should be employed. As shown in the proof of proposition:unbounded-variation-bound-Pstar, $V_1$ grows at a rate of $\frac{1}{\mu}$ in this example. Therefore, keeping $L = 1$ fixed would result in a step size of $\sim \mu$. However, this approach appears to be too conservative.
Figure 6.1: Non-interpolating case: performance of adaptive step size control for the first scenario ($\mu = 1$ and $L$ variable).
Figure 6.2: Non-interpolating case: performance of adaptive step size control for the second scenario ($L = 1$ and $\mu$ variable).
Figure 6.3: Interpolating case: performance of adaptive step size control for the first scenario ($\mu = 1$ and $L$ variable.)
Figure 6.4: Interpolating case: performance of adaptive step size control for the second scenario ($L = 1$ and $\mu$ variable).
...and 1 more figures

Theorems & Definitions (34)

definition 1
definition 2
definition 3
remark 1
lemma 1
proof
lemma 2
proof
definition 4
proposition 1
...and 24 more

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

TL;DR

Abstract

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (34)