Table of Contents
Fetching ...

Stochastic weight matrix dynamics during learning and Dyson Brownian motion

Gert Aarts, Biagio Lucini, Chanju Park

TL;DR

By treating weight updates in stochastic gradient descent as a Dyson Brownian motion of the eigenvalues of $X = W^T W$, the paper shows learning dynamics induce eigenvalue repulsion and converge to a Coulomb-gas stationary distribution that depends only on the ratio $\alpha/|{\cal B}|$. Universal spectral features arise, notably the Wigner surmise for level spacings and the Wigner semicircle for the spectral density. These predictions are tested in Gaussian RBMs and a teacher-student model, confirming the universal scaling and elucidating non-universal drift effects from the loss landscape. The results imply a fundamental limit on learning accuracy set by stochasticity while highlighting a potential generalization benefit, and they suggest extending to more complex architectures and optimizers.

Abstract

We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion, thereby inheriting many features of random matrix theory. We relate the level of stochasticity to the ratio of the learning rate and the mini-batch size, providing more robust evidence to a previously conjectured scaling relationship. We discuss universal and non-universal features in the resulting Coulomb gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model and in the (near-)solvable case of the Gaussian restricted Boltzmann machine.

Stochastic weight matrix dynamics during learning and Dyson Brownian motion

TL;DR

By treating weight updates in stochastic gradient descent as a Dyson Brownian motion of the eigenvalues of , the paper shows learning dynamics induce eigenvalue repulsion and converge to a Coulomb-gas stationary distribution that depends only on the ratio . Universal spectral features arise, notably the Wigner surmise for level spacings and the Wigner semicircle for the spectral density. These predictions are tested in Gaussian RBMs and a teacher-student model, confirming the universal scaling and elucidating non-universal drift effects from the loss landscape. The results imply a fundamental limit on learning accuracy set by stochasticity while highlighting a potential generalization benefit, and they suggest extending to more complex architectures and optimizers.

Abstract

We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion, thereby inheriting many features of random matrix theory. We relate the level of stochasticity to the ratio of the learning rate and the mini-batch size, providing more robust evidence to a previously conjectured scaling relationship. We discuss universal and non-universal features in the resulting Coulomb gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model and in the (near-)solvable case of the Gaussian restricted Boltzmann machine.
Paper Structure (14 sections, 98 equations, 8 figures)

This paper contains 14 sections, 98 equations, 8 figures.

Figures (8)

  • Figure 1: Sketch of a restricted Boltzmann machine with $N_v$ ($N_h$) nodes on the visible (hidden) layer, connected by the $N_v\times N_h$ matrix $W$.
  • Figure 2: Target spectrum $\kappa_i$ ($i=1,\ldots, 10$): each mode, except the lowest and the highest ones, is doubly degenerate.
  • Figure 3: Learnt distributions of eigenvalues $x_i =\mu^2-\lambda_i$. The target eigenvalues $\mu^2-\kappa_i$ are shown with dashed vertical lines. All except the lowest and the highest target eigenvalues are doubly degenerate.
  • Figure 4: Wigner surmise for the level spacing $S$ (left) and for the rescaled $s=S/\langle S\rangle$ (right) for the four doubly-degenerate modes labelled by $\kappa$. The lines on the left are fits with $\langle S\rangle = \sqrt{\pi}\sigma$ as a free parameter. The rescaled histograms on the right collapse to the universal curve, $P(s)$.
  • Figure 6: Response of the mean level spacing $\langle S\rangle$ (left) and the width parameter of the spectral density $\sqrt{\pi}\sigma$ (middle) to variation of the learning rate $\alpha$ and the batch size $|{\cal B}|$, presented in the combination $\sqrt{(\alpha/|{\cal B}|)\kappa_i^2\Omega_i}$, for 4 doubly-degenerate pairs, identified by target eigenvalues $\kappa_i$. Expected linear relation between $\langle S\rangle$ and $\sqrt{\pi}\sigma$ upon independent variation of $\alpha$ and $|{\cal B}|$ (right).
  • ...and 3 more figures