Table of Contents
Fetching ...

Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks

Chanju Park, Biagio Lucini, Gert Aarts

TL;DR

This work reframes stochastic gradient descent training of multilayer neural networks as a non-equilibrium phase problem in disordered systems. By mapping weight matrices and learned features to a spin-glass-like Hamiltonian and introducing an effective temperature T = ε/|B|, the authors identify three dynamical regimes (ferromagnetic/ordered, jammed/disordered, paramagnetic) and validate them with a teacher–student numerical study, deriving phase boundaries from symmetry considerations and spectral-level-spacing dynamics. The approach yields practical hyperparameter guidance, linking learning efficiency to staying in the ordered phase while avoiding high-temperature fluctuations, and connects spectral flow to learning performance. Overall, the paper provides a principled framework to understand and predict SGD behavior in neural networks using tools from random-matrix theory and non-equilibrium statistical physics, with potential implications for architecture and optimization design.

Abstract

Hyperparameter tuning is one of the essential steps to guarantee the convergence of machine learning models. We argue that intuition about the optimal choice of hyperparameters for stochastic gradient descent can be obtained by studying a neural network's phase diagram, in which each phase is characterised by distinctive dynamics of the singular values of weight matrices. Taking inspiration from disordered systems, we start from the observation that the loss landscape of a multilayer neural network with mean squared error can be interpreted as a disordered system in feature space, where the learnt features are mapped to soft spin degrees of freedom, the initial variance of the weight matrices is interpreted as the strength of the disorder, and temperature is given by the ratio of the learning rate and the batch size. As the model is trained, three phases can be identified, in which the dynamics of weight matrices is qualitatively different. Employing a Langevin equation for stochastic gradient descent, previously derived using Dyson Brownian motion, we demonstrate that the three dynamical regimes can be classified effectively, providing practical guidance for the choice of hyperparameters of the optimiser.

Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks

TL;DR

This work reframes stochastic gradient descent training of multilayer neural networks as a non-equilibrium phase problem in disordered systems. By mapping weight matrices and learned features to a spin-glass-like Hamiltonian and introducing an effective temperature T = ε/|B|, the authors identify three dynamical regimes (ferromagnetic/ordered, jammed/disordered, paramagnetic) and validate them with a teacher–student numerical study, deriving phase boundaries from symmetry considerations and spectral-level-spacing dynamics. The approach yields practical hyperparameter guidance, linking learning efficiency to staying in the ordered phase while avoiding high-temperature fluctuations, and connects spectral flow to learning performance. Overall, the paper provides a principled framework to understand and predict SGD behavior in neural networks using tools from random-matrix theory and non-equilibrium statistical physics, with potential implications for architecture and optimization design.

Abstract

Hyperparameter tuning is one of the essential steps to guarantee the convergence of machine learning models. We argue that intuition about the optimal choice of hyperparameters for stochastic gradient descent can be obtained by studying a neural network's phase diagram, in which each phase is characterised by distinctive dynamics of the singular values of weight matrices. Taking inspiration from disordered systems, we start from the observation that the loss landscape of a multilayer neural network with mean squared error can be interpreted as a disordered system in feature space, where the learnt features are mapped to soft spin degrees of freedom, the initial variance of the weight matrices is interpreted as the strength of the disorder, and temperature is given by the ratio of the learning rate and the batch size. As the model is trained, three phases can be identified, in which the dynamics of weight matrices is qualitatively different. Employing a Langevin equation for stochastic gradient descent, previously derived using Dyson Brownian motion, we demonstrate that the three dynamical regimes can be classified effectively, providing practical guidance for the choice of hyperparameters of the optimiser.

Paper Structure

This paper contains 17 sections, 65 equations, 8 figures.

Figures (8)

  • Figure 1: Left: Distribution of post-activation $\phi(z)$, with $z\sim {\cal N}(0, \sigma_z^2)$. For small $\sigma_z^2<1/2$, the distribution is peaked around zero, while for large $\sigma_z^2\gg 1/2$, the distribution is sharply peaked towards $\pm 1$. Right: Hyperbolic tangent function $\phi(z)=\tanh(z)$ as activation function, with a linear regime ($|z|\lesssim 1$) and a jamming or vanishing gradient regime ($|z| \gg 1$).
  • Figure 2: The mean test loss of the trained models in the $T-1/\sigma_W$ plane for the hyperbolic tangent teacher-student network. A darker colour indicates a lower value.
  • Figure 3: Mean loss (left) and gradient (right) at the end of training as a function of $\epsilon/|{\cal B}|$ for various values of $1/\sigma_W$. Error bars are omitted for visibility.
  • Figure 5: The correlation between features at initial and final time, $G(t_f,0)=\mathbb{E}[\phi(t_f) \phi(0)]$, (left) and between the features $\phi$ and the external magnetic field $h$, $\mathbb{E}[h \phi]$, or alignment between model and target features, (right), in the $T-1/\sigma_W$ plane for the hyperbolic tangent teacher-student network. A darker colour indicates less correlations.
  • Figure 6: Evolution of the eigenvalues of $X = W^{(2)}W^{(2)\,T}$ during training for ensembles of teacher-student models at three choices of hyperparameters, in the ordered phase (left) with $\epsilon/|{\cal B}|=2^{-5}$, $1/\sigma_W=8$, in the high-temperature phase (centre) with $\epsilon/|{\cal B}|=4$, $1/\sigma_W=8$, and in the disordered phase (right) with $\epsilon/|{\cal B}|=2^{-5}$, $1/\sigma_W=2^{-8}$. Dashed horizontal lines denote the target eigenvalues and are only visible on the left. Note the difference in scale on the vertical axis.
  • ...and 3 more figures