Universal characteristics of deep neural network loss surfaces from random matrix theory

Nicholas P Baskerville; Jonathan P Keating; Francesco Mezzadri; Joseph Najnudel; Diego Granziol

Universal characteristics of deep neural network loss surfaces from random matrix theory

Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, Joseph Najnudel, Diego Granziol

TL;DR

This work develops a general random-matrix framework for deep neural network Hessians, treating the batch Hessian as $H = \mathsf{s}(b) X + A$ with QUE-delocalised noise and a finite-rank spike structure. It proves that, under QUE and concentration, the spectrum of $H$ converges to the free convolution $\mu_X \boxplus \mu_D$, and provides explicit outlier locations via a subordination function, offering concrete predictions for Hessian outliers across batch sizes and architectures. Experimental validation with Lanczos-based outlier extraction on CIFAR-100 and MNIST demonstrates strong agreement with the theory for certain models (notably ResNet), supporting the presence of universal local random-matrix statistics in real DNN Hessians. The paper also connects these spectral insights to optimization, showing that local laws can drastically simplify preconditioned SGD dynamics and offering a general perspective on the prevalence of minima and the rough/smooth dichotomy of loss surfaces.

Abstract

This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.

Universal characteristics of deep neural network loss surfaces from random matrix theory

TL;DR

This work develops a general random-matrix framework for deep neural network Hessians, treating the batch Hessian as

with QUE-delocalised noise and a finite-rank spike structure. It proves that, under QUE and concentration, the spectrum of

converges to the free convolution

, and provides explicit outlier locations via a subordination function, offering concrete predictions for Hessian outliers across batch sizes and architectures. Experimental validation with Lanczos-based outlier extraction on CIFAR-100 and MNIST demonstrates strong agreement with the theory for certain models (notably ResNet), supporting the presence of universal local random-matrix statistics in real DNN Hessians. The paper also connects these spectral insights to optimization, showing that local laws can drastically simplify preconditioned SGD dynamics and offering a general perspective on the prevalence of minima and the rough/smooth dichotomy of loss surfaces.

Abstract

Paper Structure (26 sections, 9 theorems, 146 equations, 5 figures)

This paper contains 26 sections, 9 theorems, 146 equations, 5 figures.

Introduction
Notation
General random matrix model for loss surface Hessians
The model
Quantum unique ergodicity
Batch Hessian outliers
An interlude on prior outlier results
Experimental results
Justification and motivation of QUE
Motivation of true Hessian structure
The batch size scaling
Spectral free addition from QUE
Intermediate results on QUE
Main result
Experimental validation
...and 11 more sections

Key Result

Lemma 3.1

Consider a real orthogonal $N\times N$ matrix $U$ with rows $\{\bm{u}_i^T\}_{i=1}^N$. Assume that $\{\bm{u}_i\}_{i=1}^N$ are the eigenvectors of a real random symmetric matrix with QUE. Let $P$ be a fixed $N\times N$ real orthogonal matrix. Let $V = UP$ and denote the rows of $V$ by $\{\bm{v}_i^T\}_

Figures (5)

Figure 1: The batch-size scaling of the outliers in the spectra of the Hessians of the Resnet loss on CIFAR100. Training epochs increase top-to-bottom from initialisation to final trained model. Left-to-right the outlier index varies (outlier 1 being the largest). Red cross show results from Lanczos approximations over 10 samples (different batches) for each batch size. The blue lines are parametric power law fits of the form (\ref{['eq:omega_fit_form']}).
Figure 2: Left-to-right the outlier index varies (outlier 1 being the largest). Red cross show results from Lanczos approximations over 10 samples (different batches) for each batch size. The blue lines are parametric power law fits of the form (\ref{['eq:omega_fit_form']}). This plot show the final epoch (300) for the VGG16 on CIFAR100 and the first epoch for the MLP on MNIST, both being examples of the parametric fit failing to match the data.
Figure 3: The blue lines are parametric power law fits of the form (\ref{['eq:omega_fit_form']}).
Figure 4: Comparison of theoretical spectral density and empirical from sampled matrices all of size $500\times 500$. We combine $50$ independent matrix samples per plot.
Figure 5: q-q plot comparing the spectrum of samples from $Reg^{N,d} + UWig^N$ ($y$-axis) to samples from $Reg^{N,d} + GOE^N$ ($x$-axis).

Theorems & Definitions (23)

Remark 2.1
Remark 2.2
Remark 2.4
Lemma 3.1
proof
Lemma 3.2
proof
Lemma 3.3
proof
Theorem 3.4
...and 13 more

Universal characteristics of deep neural network loss surfaces from random matrix theory

TL;DR

Abstract

Universal characteristics of deep neural network loss surfaces from random matrix theory

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (23)