Table of Contents
Fetching ...

Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits

Daniel Zantedeschi, Kumar Muthuraman

TL;DR

A Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance -- is developed, which implies oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning.

Abstract

We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.

Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits

TL;DR

A Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance -- is developed, which implies oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning.

Abstract

We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.
Paper Structure (104 sections, 20 theorems, 83 equations, 10 figures, 5 tables)

This paper contains 104 sections, 20 theorems, 83 equations, 10 figures, 5 tables.

Key Result

Lemma 3.2

Assume $\int p_\theta(z)\,dz=1$ and differentiation under the integral is valid. Then (i) $\mathbb{E}_\theta[s_\theta(Z)]=0$ and (ii)

Figures (10)

  • Figure 1: Isotropic noise (left) vs. Fisher-aligned noise (right) at the same total variance. Batch size rescales temperature but not shape.
  • Figure 2: OU benchmark: stationary variances vs. batch size $b$ (log scales). Solid: simulation; dashed: Lyapunov prediction from \ref{['eq:lyap_batch_num']}.
  • Figure 3: Angle sweep: stationary variances vs. misalignment angle. Bands show replicate variability; dashed curves are the discrete Lyapunov prediction from \ref{['eq:disc_lyap_num']}.
  • Figure 4: Decaying step size, $d=10$. Left: Fisher-metric risk vs. $N$ (log--log); slopes $\approx -1$ confirm $1/N$ rate. Right: Scaled risk $N\cdot\widehat{\mathrm{Risk}}$ stabilizes; inset shows $C=\mathrm{Tr}(G^\star H^{-1})$.
  • Figure 5: Stationary covariance ellipses at three rotation angles: the Fisher/Godambe equilibrium tilts with $\varphi$; the isotropic trace-matched surrogate remains axis-aligned.
  • ...and 5 more figures

Theorems & Definitions (49)

  • Definition 3.1: Fisher information
  • Lemma 3.2: Score identities lehmann1998theory
  • Definition 3.3: Noise covariance, sensitivity, and Godambe information
  • Remark 3.4: Likelihood as a special case
  • Definition 3.5: Tangent-space projection and projected geometry
  • Remark 3.6: Terminological convention: "Fisher/Godambe"
  • Remark 3.7: What enters later sections
  • Remark 3.11: What these assumptions buy us
  • Lemma 4.1: Exact finite-population covariance of the mini-batch mean
  • Remark 4.2: Equivalent correction-factor forms
  • ...and 39 more