Spectral alignment of stochastic gradient descent for high-dimensional classification tasks
Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath
TL;DR
This work establishes a rigorous link between SGD dynamics and the spectral geometry of empirical Hessian and G-matrix matrices in high-dimensional classification. By analyzing two canonical models—the linear-classification GMM with a single layer and a two-layer XOR-GMM—the authors prove that, shortly after training begins, SGD trajectories and the outlier eigenspaces of the Hessian and G-matrices align with a common, low-dimensional subspace determined by the data means. The results extend layerwise in multi-layer networks, showing per-layer alignment and even rank-deficient outlier spaces when SGD converges to suboptimal classifiers. Central to the proofs are population matrix decompositions into low-rank outliers plus small bulk, ballistic-limit SGD dynamics for summary statistics, and uniform concentration of empirical matrices; these are then extended to train-data matrices. Overall, the paper provides a principled, quantitative account of the spectral predictions from numerical studies, connecting them to provable dynamical and algebraic structures in overparameterized classification tasks.
Abstract
We rigorously study the relation between the training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, both the SGD trajectory and emergent outlier eigenspaces of the Hessian and gradient matrices align with a common low-dimensional subspace. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.
