Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent
Frederik Köhne, Leonie Kreis, Anton Schiela, Roland Herzog
TL;DR
This work addresses the challenge of choosing learning rates in SGD by tying step sizes to computable, locally observable quantities: the gradient Lipschitz constant $L$ and the local variance of the search direction. It proposes a step size rule that combines estimates of nonlinearity and stochasticity, formulated in a Hilbert space with optional preconditioning, and proves convergence guarantees that adapt to both interpolating and non-interpolating regimes. The approach relies on estimators for $L$ and variance that are obtainable during SGD with an additional forward pass per minibatch, enabling near hyperparameter-free optimization. Numerical results on quadratic problems and standard image classification benchmarks demonstrate robust, problem-adaptive behavior across diverse settings, with minimal per-iteration overhead and practical safeguards for nonconvex landscapes.
Abstract
This paper proposes a novel approach to adaptive step sizes in stochastic gradient descent (SGD) by utilizing quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions. Our findings yield a nearly hyperparameter-free algorithm for stochastic optimization, which has provable convergence properties and exhibits truly problem adaptive behavior on classical image classification tasks. Our framework is set in a general Hilbert space and thus enables the potential inclusion of a preconditioner through the choice of the inner product.
