From SGD to Spectra: A Theory of Neural Network Weight Dynamics
Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula
TL;DR
This work develops a matrix-valued Itô SDE framework that links the microscopic stochastic dynamics of SGD to the macroscopic evolution of neural-network weight spectra. It shows that squared singular values follow Dyson Brownian motion with $\beta=1$, and in the non-negligible gradient regime the stationary spectrum obeys a gamma-type law, explaining the empirically observed bulk+tail spectral structure. The authors validate the theory with controlled experiments on GPT-2, ViT, and MLP architectures, and provide a forecasting algorithm that predicts singular-value trajectories from minimal gradient information. The findings offer a rigorous foundation for understanding why deep networks train effectively and suggest spectral-aware initialization, adaptive optimization, and pruning strategies that leverage the learned spectral structure. While the isotropic-noise assumption underpins the theory, the Appendix outlines extensions to anisotropic SGD fluctuations, highlighting future directions toward closer alignment with real training dynamics.
Abstract
Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.
