Table of Contents
Fetching ...

Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi

TL;DR

The paper introduces a geometry-aware analysis of SGD preconditioned by a symmetric positive definite matrix $\mathbf{M}$, showing that late-stage convergence is governed by an $\mathbf{M}$-based condition number $\hat{L}/\hat{c}$ and a preconditioned noise level $K$, yielding a linear rate with a floor of $\frac{\overline{\alpha}\hat{L}K}{2\hat{c}\mu}$ for fixed steps and an $\mathcal{O}(1/k)$ rate with a vanishing floor for diminishing steps. It extends global results to a local basin via a local $\mathbf{M}$-PL condition and a basin-stability bound, demonstrating how the preconditioner can simultaneously improve curvature alignment and suppress gradient noise. The framework covers both diagonal/adaptive and curvature-aware preconditioners, and a simple design principle—choose $\mathbf{M}$ to reduce $\hat{L}/\hat{c}$ and minimize $K$—guides practical SGD improvements. Empirical validation on a quadratic diagnostic and three SciML tasks (Noisy Franke surface, PINNs, and Green’s-function learning) shows that curvature-aware preconditioners (GGN/Fisher) often yield faster late-stage contraction and lower floors, aligning with the theory and underscoring the practical relevance for physics-informed, stability-critical problems.

Abstract

Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

TL;DR

The paper introduces a geometry-aware analysis of SGD preconditioned by a symmetric positive definite matrix , showing that late-stage convergence is governed by an -based condition number and a preconditioned noise level , yielding a linear rate with a floor of for fixed steps and an rate with a vanishing floor for diminishing steps. It extends global results to a local basin via a local -PL condition and a basin-stability bound, demonstrating how the preconditioner can simultaneously improve curvature alignment and suppress gradient noise. The framework covers both diagonal/adaptive and curvature-aware preconditioners, and a simple design principle—choose to reduce and minimize —guides practical SGD improvements. Empirical validation on a quadratic diagnostic and three SciML tasks (Noisy Franke surface, PINNs, and Green’s-function learning) shows that curvature-aware preconditioners (GGN/Fisher) often yield faster late-stage contraction and lower floors, aligning with the theory and underscoring the practical relevance for physics-informed, stability-critical problems.

Abstract

Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix , deriving bounds in which both the convergence rate and the stochastic noise floor are governed by -dependent quantities: the rate through an effective condition number in the -metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the -norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

Paper Structure

This paper contains 8 sections, 13 theorems, 100 equations, 11 figures, 1 table.

Key Result

Lemma 1

Let $F$ be twice differentiable and $\mathbf{M}^{-1}=\mathbf{P}\mathbf{P}^\top$. Then: (i) $\nabla F$ is $\mathbf{M}$-Lipschitz with constant $\hat{L}$$\iff$ all eigenvalues of $\mathbf{P}^\top\nabla^2F(\mathbf{w})\mathbf{P}$ are $\le \hat{L}$; (ii) $F$ is $\mathbf{M}$-strongly convex with constant

Figures (11)

  • Figure 1: Two-layer MLP with 256 hidden units per layer, trained using fixed learning rate SGD (batch size 128) on Fashion MNIST. Left: Theoretical $\mathcal{O}(1/k)$ decay (dashed) versus empirical training loss. Right: Zoom-in on the asymptotic regime showing the noise floor.
  • Figure 2: Convergence behavior under different deflation-based preconditioners. Left: deflating the largest $s$ eigenvalues ($s\in\{1,5,10,25,50\}$). Middle: deflating the top $20$ eigenvalues to target values $1.0,2.0,3.0,5.0,10.0]$. Right: deflating the smallest $s$ eigenvalues ($s\in\{1,5,10,25,50\}$).
  • Figure 3: Franke-function regression (mean over $5$ runs). Left: training loss vs. epochs with the switch to Phase II at epoch $500$. Center: training loss vs. wall–clock time. Right: Franke surface.
  • Figure 4: PINN for a Poisson‑type PDE (mean over $5$ runs). Left: training loss vs. epochs with Phase I $\rightarrow$ Phase II at epoch $1{,}000$. Center: training loss vs. wall–clock time. Right: source term.
  • Figure 5: Laplacian Green’s‑function learning (mean over $5$ runs). Left: loss vs. epochs with Phase I $\rightarrow$ Phase II at epoch $2{,}000$. Center: loss vs. wall–clock time. Right: learned $G(x,y)$ for three source locations and operator checks.
  • ...and 6 more figures

Theorems & Definitions (23)

  • Lemma 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: Convergence to a local minimizer
  • Theorem 5: Diminishing learning rate, local regime
  • Remark 1
  • Theorem 6: Strongly convex objective function, fixed learning rate bottou_optimization_2018
  • Theorem 7: Strongly convex objective function, diminishing learning rates bottou_optimization_2018
  • Lemma 8
  • proof
  • ...and 13 more