Table of Contents
Fetching ...

Dyson Brownian motion and random matrix dynamics of weight matrices during learning

Gert Aarts, Ouraman Hajizadeh, Biagio Lucini, Chanju Park

Abstract

During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.

Dyson Brownian motion and random matrix dynamics of weight matrices during learning

Abstract

During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.

Paper Structure

This paper contains 7 sections, 9 equations, 4 figures.

Figures (4)

  • Figure 1: Gaussian RBM: Ratio of the RBM eigenvalues $\lambda_i=\mu^2-x_i$ and the target eigenvalues $\kappa_i$ as a function of $\alpha/|{\cal B}|$, where $\alpha$ and $|{\cal B}|$ are independently varied, demonstrating eigenvalue repulsion for non-vanishing stochasticity (left). Response of the mean level spacing $\langle S\rangle$ to variation of $\alpha$ and $|{\cal B}|$, presented in the combination $\sqrt{\alpha/|{\cal B}|}$ times a non-universal function $\sqrt{\kappa_i^2 \Omega_i}=\sqrt{\mu^2-\kappa_i}$ (right). Figures from Ref. Aarts:2024wxi.
  • Figure 2: Gaussian RBM: Evolution of eigenvalues of $X = \sigma_h^2 W^TW$ from the Marchenko-Pastur distribution at initialisation (left) to the learned distribution around the target eigenvalues, indicated with the vertical lines, at the end of training (right).
  • Figure 3: Transformer: Evolution during training of the eigenvalue distribution of $X=K^TK$, where $K$ is the Key matrix of the transformer's first layer, at initialisation (left), iteration 1000 (middle) and iteration 5000 (right). Above: distribution $P(s)$ of the normalised eigenvalue spacing $s_i=x_{i+1}-x_i$ after spectral unfolding, compared to the Wigner surmise. Below: spectral density $\rho(x)$, compared to fits to the Marchenko-Pastur distribution with fit parameters $\sigma^2$ and area $A$.
  • Figure 4: Transformer: Evolution of fit parameters area $A$ (left) and $\sigma^2$ (right) of the Marchenko-Pastur distribution fit to the spectral density $\rho(x)$ of $X=K^TK$ of the first head for all layers. To determine the statistical uncertainty, training is repeated at least 50 times, using a bootstrap analysis.