Table of Contents
Fetching ...

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Frederik Köhne, Leonie Kreis, Anton Schiela, Roland Herzog

TL;DR

This work addresses the challenge of choosing learning rates in SGD by tying step sizes to computable, locally observable quantities: the gradient Lipschitz constant $L$ and the local variance of the search direction. It proposes a step size rule that combines estimates of nonlinearity and stochasticity, formulated in a Hilbert space with optional preconditioning, and proves convergence guarantees that adapt to both interpolating and non-interpolating regimes. The approach relies on estimators for $L$ and variance that are obtainable during SGD with an additional forward pass per minibatch, enabling near hyperparameter-free optimization. Numerical results on quadratic problems and standard image classification benchmarks demonstrate robust, problem-adaptive behavior across diverse settings, with minimal per-iteration overhead and practical safeguards for nonconvex landscapes.

Abstract

This paper proposes a novel approach to adaptive step sizes in stochastic gradient descent (SGD) by utilizing quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions. Our findings yield a nearly hyperparameter-free algorithm for stochastic optimization, which has provable convergence properties and exhibits truly problem adaptive behavior on classical image classification tasks. Our framework is set in a general Hilbert space and thus enables the potential inclusion of a preconditioner through the choice of the inner product.

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

TL;DR

This work addresses the challenge of choosing learning rates in SGD by tying step sizes to computable, locally observable quantities: the gradient Lipschitz constant and the local variance of the search direction. It proposes a step size rule that combines estimates of nonlinearity and stochasticity, formulated in a Hilbert space with optional preconditioning, and proves convergence guarantees that adapt to both interpolating and non-interpolating regimes. The approach relies on estimators for and variance that are obtainable during SGD with an additional forward pass per minibatch, enabling near hyperparameter-free optimization. Numerical results on quadratic problems and standard image classification benchmarks demonstrate robust, problem-adaptive behavior across diverse settings, with minimal per-iteration overhead and practical safeguards for nonconvex landscapes.

Abstract

This paper proposes a novel approach to adaptive step sizes in stochastic gradient descent (SGD) by utilizing quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions. Our findings yield a nearly hyperparameter-free algorithm for stochastic optimization, which has provable convergence properties and exhibits truly problem adaptive behavior on classical image classification tasks. Our framework is set in a general Hilbert space and thus enables the potential inclusion of a preconditioner through the choice of the inner product.
Paper Structure (37 sections, 13 theorems, 71 equations, 6 figures, 1 table, 5 algorithms)

This paper contains 37 sections, 13 theorems, 71 equations, 6 figures, 1 table, 5 algorithms.

Key Result

lemma 1

Let $(f_\xi, \Omega, P)$ be a $(\mu, L)$-feasible SOP such that $f_\xi$ is $L_\xi$-smooth for some measurable function $\xi \mapsto L_\xi$. Then the variance assumption eq:var_assumption holds with

Figures (6)

  • Figure 2.1: A step size $\sim \mu$ is too conservative. The figure shows a comparison of different step sizes, in dependency of the convexity parameter $\mu$ for the example in the proof of proposition:unbounded-variation-bound-Pstar. SGD's relative progress is plotted, with higher values indicating better performance. According to the theory presented in section:problem-setting, a step size of $\frac{1}{L \, (1 + V_1)}$ should be employed. As shown in the proof of proposition:unbounded-variation-bound-Pstar, $V_1$ grows at a rate of $\frac{1}{\mu}$ in this example. Therefore, keeping $L = 1$ fixed would result in a step size of $\sim \mu$. However, this approach appears to be too conservative.
  • Figure 6.1: Non-interpolating case: performance of adaptive step size control for the first scenario ($\mu = 1$ and $L$ variable).
  • Figure 6.2: Non-interpolating case: performance of adaptive step size control for the second scenario ($L = 1$ and $\mu$ variable).
  • Figure 6.3: Interpolating case: performance of adaptive step size control for the first scenario ($\mu = 1$ and $L$ variable.)
  • Figure 6.4: Interpolating case: performance of adaptive step size control for the second scenario ($L = 1$ and $\mu$ variable).
  • ...and 1 more figures

Theorems & Definitions (34)

  • definition 1
  • definition 2
  • definition 3
  • remark 1
  • lemma 1
  • proof
  • lemma 2
  • proof
  • definition 4
  • proposition 1
  • ...and 24 more