Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

Gerard Ben Arous; Reza Gheissari; Jiaoyang Huang; Aukosh Jagannath

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath

TL;DR

This work establishes a rigorous link between SGD dynamics and the spectral geometry of empirical Hessian and G-matrix matrices in high-dimensional classification. By analyzing two canonical models—the linear-classification GMM with a single layer and a two-layer XOR-GMM—the authors prove that, shortly after training begins, SGD trajectories and the outlier eigenspaces of the Hessian and G-matrices align with a common, low-dimensional subspace determined by the data means. The results extend layerwise in multi-layer networks, showing per-layer alignment and even rank-deficient outlier spaces when SGD converges to suboptimal classifiers. Central to the proofs are population matrix decompositions into low-rank outliers plus small bulk, ballistic-limit SGD dynamics for summary statistics, and uniform concentration of empirical matrices; these are then extended to train-data matrices. Overall, the paper provides a principled, quantitative account of the spectral predictions from numerical studies, connecting them to provable dynamical and algebraic structures in overparameterized classification tasks.

Abstract

We rigorously study the relation between the training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, both the SGD trajectory and emergent outlier eigenspaces of the Hessian and gradient matrices align with a common low-dimensional subspace. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

TL;DR

Abstract

Paper Structure (43 sections, 35 theorems, 240 equations, 12 figures)

This paper contains 43 sections, 35 theorems, 240 equations, 12 figures.

Introduction
Our contributions
Main Results
Classifying linearly separable mixture models
Data model
Results and discussion
Classifying XOR-type mixture models via two-layer networks
Data model
Results and discussion
Outline and ideas of proof
Global notation
Analysis of the population matrices: 1-layer networks
Preliminary calculations and notation
Analysis of the population Hessian matrix
On-diagonal blocks
...and 28 more sections

Key Result

Theorem 2.3

Consider the mixture of $k$-Gaussians with loss function from (eq:cross-entropy-loss), and SGD (eq:SGD-def) with learning rate $\delta = O(1/d)$, regularizer $\beta>0$, initialized from $\mathcal{N}(0,I_d/d)$. There exists $\alpha_0, \lambda_0$ such that if $\lambda\ge \lambda_0$, and $\widetilde{M} for every $c\in [k]$, up to $O(\varepsilon + \lambda^{-1})$ error, for all $\ell \in [T_0\delta^{-1

Figures (12)

Figure 2.1: The alignment of the SGD trajectory $\mathbf{x}_\ell^c$ with $E_k(\nabla^2_{cc}\widehat{R}(\mathbf{x}_\ell))$ (left) and $E_k(\widehat{G}_{cc}(\mathbf{x}_\ell))$ (right), for $c\in [k]$ (shown in different colors). The $x$-axis is rescaled time, $\ell \delta$. The parameters are $k=10$ classes in dimension $d=1000$ with $\lambda=10$, $\beta = 0.01$, and $\delta = 1/d$.
Figure 2.2: From left to right: Plot of entries of $\mathbf{x}_\ell^1$ and the $k$ leading eigenvectors (in different colors) of $\nabla_{11}^2\widehat{R}(\mathbf{x}_\ell)$ and $\widehat{G}_{11}(\mathbf{x}_\ell)$ respectively at the end of training, namely $\ell = 50\cdot d=25,000$ steps. Here the $x$-axis represents the coordinate index. The parameters are the same as in Fig. \ref{['fig:KGMM-topspaces']} and the means are $\mu_i = e_{i*50}$.
Figure 2.3: Left: the eigenvalues (in different colors) of $\nabla^2 \widehat{R}_{11}(\mathbf{x}_\ell)$ over the course of training. The leading $k$ eigenvalues are separated from the bulk at all times, and the top eigenvalue, corresponding to $\mu_1$ separates from the remaining eigenvalues soon after initialization. Right: the inner product of $\mathbf{x}_\ell^1$ with the means $\mu_1,...,\mu_k$ undergoes a similar separation over the course of training. Parameters are the same as in preceding figures.
Figure 2.4: (a) and (b) depict the alignment of the first layer weights $W_i(\mathbf{x}_\ell)$ for $i=1,...,K$ (in different colors) with the principal subspaces of the corresponding blocks of the Hessian and G-matrices, i.e., with $E_2(\nabla^2_{W_i W_i} \widehat{R}(\mathbf{x}_\ell))$ and $E_2(\widehat{G}_{W_i W_i}(\mathbf{x}_\ell))$. (c) and (d) plot the second-layer alignment, namely of $v(\mathbf{x}_\ell)$ with $E_4(\nabla^2_{vv} \widehat{R}(\mathbf{x}_\ell))$ and $E_4(\widehat{G}_{vv}(\mathbf{x}_\ell))$. Parameters are $d=1000$, $\lambda=10$, and $K=20$
Figure 2.5: The eigenvalues (in different colors) of the $vv$ blocks of the Hessian and G-matrices over time from a random initialization. Initially, there is one outlier eigenvalue due to the positivity of the ReLU activation. Along training, four outlier eigenvalues separate from the bulk, corresponding to the four "hidden" classes in the XOR problem. Parameters are the same as in Figure \ref{['fig:XOR-topspaces']}.
...and 7 more figures

Theorems & Definitions (75)

Definition 2.1
Definition 2.2
Theorem 2.3
Theorem 2.4
Theorem 2.5
Theorem 2.6
Remark 1
Remark 2
Theorem 2.7
Remark 2.8
...and 65 more

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

TL;DR

Abstract

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (75)