Random Matrix Theory for Stochastic Gradient Descent
Chanju Park, Matteo Favoni, Biagio Lucini, Gert Aarts
TL;DR
The paper addresses how stochastic gradient descent dynamics can be understood from random-matrix theory by mapping weight updates to Dyson Brownian motion of eigenvalues. It derives a linear scaling rule between the learning rate and batch size, showing that fluctuations scale with $\sqrt{\alpha/|B|}$ and that the stationary state can be described by a Coulomb-gas formulation. The authors validate the approach on a Gaussian RBM where the spectrum follows the Wigner semicircle and level spacings obey the Wigner surmise, and extend the analysis to a linear neural network with one hidden layer, revealing a multi-species Coulomb gas and a generalized Wigner semicircle. This framework provides a principled explanation for SGD fluctuations and offers guidance for hyperparameter tuning in practice, with potential extensions to more complex architectures.
Abstract
Investigating the dynamics of learning in machine learning algorithms is of paramount importance for understanding how and why an approach may be successful. The tools of physics and statistics provide a robust setting for such investigations. Here we apply concepts from random matrix theory to describe stochastic weight matrix dynamics, using the framework of Dyson Brownian motion. We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics. We test our findings in the (near-)solvable case of the Gaussian Restricted Boltzmann Machine and in a linear one-hidden-layer neural network.
