Stochastic weight matrix dynamics during learning and Dyson Brownian motion
Gert Aarts, Biagio Lucini, Chanju Park
TL;DR
By treating weight updates in stochastic gradient descent as a Dyson Brownian motion of the eigenvalues of $X = W^T W$, the paper shows learning dynamics induce eigenvalue repulsion and converge to a Coulomb-gas stationary distribution that depends only on the ratio $\alpha/|{\cal B}|$. Universal spectral features arise, notably the Wigner surmise for level spacings and the Wigner semicircle for the spectral density. These predictions are tested in Gaussian RBMs and a teacher-student model, confirming the universal scaling and elucidating non-universal drift effects from the loss landscape. The results imply a fundamental limit on learning accuracy set by stochasticity while highlighting a potential generalization benefit, and they suggest extending to more complex architectures and optimizers.
Abstract
We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion, thereby inheriting many features of random matrix theory. We relate the level of stochasticity to the ratio of the learning rate and the mini-batch size, providing more robust evidence to a previously conjectured scaling relationship. We discuss universal and non-universal features in the resulting Coulomb gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model and in the (near-)solvable case of the Gaussian restricted Boltzmann machine.
