Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability
Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi
TL;DR
The paper introduces a geometry-aware analysis of SGD preconditioned by a symmetric positive definite matrix $\mathbf{M}$, showing that late-stage convergence is governed by an $\mathbf{M}$-based condition number $\hat{L}/\hat{c}$ and a preconditioned noise level $K$, yielding a linear rate with a floor of $\frac{\overline{\alpha}\hat{L}K}{2\hat{c}\mu}$ for fixed steps and an $\mathcal{O}(1/k)$ rate with a vanishing floor for diminishing steps. It extends global results to a local basin via a local $\mathbf{M}$-PL condition and a basin-stability bound, demonstrating how the preconditioner can simultaneously improve curvature alignment and suppress gradient noise. The framework covers both diagonal/adaptive and curvature-aware preconditioners, and a simple design principle—choose $\mathbf{M}$ to reduce $\hat{L}/\hat{c}$ and minimize $K$—guides practical SGD improvements. Empirical validation on a quadratic diagnostic and three SciML tasks (Noisy Franke surface, PINNs, and Green’s-function learning) shows that curvature-aware preconditioners (GGN/Fisher) often yield faster late-stage contraction and lower floors, aligning with the theory and underscoring the practical relevance for physics-informed, stability-critical problems.
Abstract
Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.
