Table of Contents
Fetching ...

Random Matrix Theory for Stochastic Gradient Descent

Chanju Park, Matteo Favoni, Biagio Lucini, Gert Aarts

TL;DR

The paper addresses how stochastic gradient descent dynamics can be understood from random-matrix theory by mapping weight updates to Dyson Brownian motion of eigenvalues. It derives a linear scaling rule between the learning rate and batch size, showing that fluctuations scale with $\sqrt{\alpha/|B|}$ and that the stationary state can be described by a Coulomb-gas formulation. The authors validate the approach on a Gaussian RBM where the spectrum follows the Wigner semicircle and level spacings obey the Wigner surmise, and extend the analysis to a linear neural network with one hidden layer, revealing a multi-species Coulomb gas and a generalized Wigner semicircle. This framework provides a principled explanation for SGD fluctuations and offers guidance for hyperparameter tuning in practice, with potential extensions to more complex architectures.

Abstract

Investigating the dynamics of learning in machine learning algorithms is of paramount importance for understanding how and why an approach may be successful. The tools of physics and statistics provide a robust setting for such investigations. Here we apply concepts from random matrix theory to describe stochastic weight matrix dynamics, using the framework of Dyson Brownian motion. We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics. We test our findings in the (near-)solvable case of the Gaussian Restricted Boltzmann Machine and in a linear one-hidden-layer neural network.

Random Matrix Theory for Stochastic Gradient Descent

TL;DR

The paper addresses how stochastic gradient descent dynamics can be understood from random-matrix theory by mapping weight updates to Dyson Brownian motion of eigenvalues. It derives a linear scaling rule between the learning rate and batch size, showing that fluctuations scale with and that the stationary state can be described by a Coulomb-gas formulation. The authors validate the approach on a Gaussian RBM where the spectrum follows the Wigner semicircle and level spacings obey the Wigner surmise, and extend the analysis to a linear neural network with one hidden layer, revealing a multi-species Coulomb gas and a generalized Wigner semicircle. This framework provides a principled explanation for SGD fluctuations and offers guidance for hyperparameter tuning in practice, with potential extensions to more complex architectures.

Abstract

Investigating the dynamics of learning in machine learning algorithms is of paramount importance for understanding how and why an approach may be successful. The tools of physics and statistics provide a robust setting for such investigations. Here we apply concepts from random matrix theory to describe stochastic weight matrix dynamics, using the framework of Dyson Brownian motion. We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics. We test our findings in the (near-)solvable case of the Gaussian Restricted Boltzmann Machine and in a linear one-hidden-layer neural network.
Paper Structure (7 sections, 32 equations, 4 figures)

This paper contains 7 sections, 32 equations, 4 figures.

Figures (4)

  • Figure 1: General structure of a Restricted Boltzmann Machine, with $N_v$ ($N_h$) visible (hidden) nodes.
  • Figure 2: (Left) Target eigenvalues (dashed lines) and model eigenvalues (histograms) after training. The middle $8$ target eigenvalues are doubly degenerate due to periodic boundary conditions. (Right) Close-up of one of the peaks: the learnt eigenvalue distribution of the RBM follows the Wigner semi-circle (solid line).
  • Figure 4: Training dynamics of the square of the singular values of a $2 \times 2$ student matrix, given a teacher matrix with doubly degenerate eigenvalues, using $Z$ as in Eq. \ref{['eq:mat_1']} (left) and Eq. \ref{['eq:mat_2']} (right). The presence of $Z$ affects the rate of convergence. Shown are an ensemble of 20 networks (with opaque lines), the evolution averaged over an ensemble of 500 networks (with solid blue and orange lines), and fits to Eq. \ref{['eq:fit']}, starting from epoch $t=2$ (with dashed lines), agreeing with the averaged evolution.
  • Figure 5: Histogram of the spectral density $\rho(x)$ in the presence of a hidden layer, with $Z$ as in Eq. \ref{['eq:mat_1']} (left) and Eq. \ref{['eq:mat_2']} (right). Also shown are fits to the standard Wigner semi-circle \ref{['eq:spectral_density']} (dashed line) and the generalised Wigner semi-circle \ref{['eq:gen_wigsc']} for two species (solid line). The generalised Wigner semi-circle better captures the histogram's peak and wider tails, as seen in particular on the right.