Table of Contents
Fetching ...

From SGD to Spectra: A Theory of Neural Network Weight Dynamics

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula

TL;DR

This work develops a matrix-valued Itô SDE framework that links the microscopic stochastic dynamics of SGD to the macroscopic evolution of neural-network weight spectra. It shows that squared singular values follow Dyson Brownian motion with $\beta=1$, and in the non-negligible gradient regime the stationary spectrum obeys a gamma-type law, explaining the empirically observed bulk+tail spectral structure. The authors validate the theory with controlled experiments on GPT-2, ViT, and MLP architectures, and provide a forecasting algorithm that predicts singular-value trajectories from minimal gradient information. The findings offer a rigorous foundation for understanding why deep networks train effectively and suggest spectral-aware initialization, adaptive optimization, and pruning strategies that leverage the learned spectral structure. While the isotropic-noise assumption underpins the theory, the Appendix outlines extensions to anisotropic SGD fluctuations, highlighting future directions toward closer alignment with real training dynamics.

Abstract

Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

From SGD to Spectra: A Theory of Neural Network Weight Dynamics

TL;DR

This work develops a matrix-valued Itô SDE framework that links the microscopic stochastic dynamics of SGD to the macroscopic evolution of neural-network weight spectra. It shows that squared singular values follow Dyson Brownian motion with , and in the non-negligible gradient regime the stationary spectrum obeys a gamma-type law, explaining the empirically observed bulk+tail spectral structure. The authors validate the theory with controlled experiments on GPT-2, ViT, and MLP architectures, and provide a forecasting algorithm that predicts singular-value trajectories from minimal gradient information. The findings offer a rigorous foundation for understanding why deep networks train effectively and suggest spectral-aware initialization, adaptive optimization, and pruning strategies that leverage the learned spectral structure. While the isotropic-noise assumption underpins the theory, the Appendix outlines extensions to anisotropic SGD fluctuations, highlighting future directions toward closer alignment with real training dynamics.

Abstract

Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

Paper Structure

This paper contains 36 sections, 19 theorems, 65 equations, 5 figures, 1 algorithm.

Key Result

Theorem 3.1

Let $W \in \mathbb{R}^{m \times n}$ evolve via stochastic gradient descent with noise. Then, the singular values $\sigma_k(W)$ follow the SDE: where $u_k, v_k$ are the singular vectors and $D$ is the effective diffusion strength.

Figures (5)

  • Figure 1: Singular‐value histograms at batches 0, 200, and 400, overlaid with the Marčenko–Pastur (MP) bulk law (red dashed) and the Tracy–Widom (TW) edge curve (green).
  • Figure 2: Predicted singular values (dashed) versus true.
  • Figure 3: Predicted heavy tails via \ref{['thm3.2']}.
  • Figure 4: Mean variance for gaussian distributions
  • Figure 5: Spread of singular values (max–median) versus learning rate for different vision transformer weight matrices, with red dashed least-squares trends and slopes indicating sensitivity.

Theorems & Definitions (33)

  • Theorem 3.1: Stochastic Dynamics of Singular Values
  • Theorem 3.2: Stationary Distribution of Singular Values
  • proof
  • proof
  • Theorem 6.1: Error Signal Recursion
  • proof
  • Corollary 6.2: Gradient Formulas
  • proof
  • Theorem 6.3: Continuum PDE for Weight Evolution
  • proof
  • ...and 23 more