Table of Contents
Fetching ...

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath

TL;DR

This work establishes a rigorous link between SGD dynamics and the spectral geometry of empirical Hessian and G-matrix matrices in high-dimensional classification. By analyzing two canonical models—the linear-classification GMM with a single layer and a two-layer XOR-GMM—the authors prove that, shortly after training begins, SGD trajectories and the outlier eigenspaces of the Hessian and G-matrices align with a common, low-dimensional subspace determined by the data means. The results extend layerwise in multi-layer networks, showing per-layer alignment and even rank-deficient outlier spaces when SGD converges to suboptimal classifiers. Central to the proofs are population matrix decompositions into low-rank outliers plus small bulk, ballistic-limit SGD dynamics for summary statistics, and uniform concentration of empirical matrices; these are then extended to train-data matrices. Overall, the paper provides a principled, quantitative account of the spectral predictions from numerical studies, connecting them to provable dynamical and algebraic structures in overparameterized classification tasks.

Abstract

We rigorously study the relation between the training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, both the SGD trajectory and emergent outlier eigenspaces of the Hessian and gradient matrices align with a common low-dimensional subspace. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.

Spectral alignment of stochastic gradient descent for high-dimensional classification tasks

TL;DR

This work establishes a rigorous link between SGD dynamics and the spectral geometry of empirical Hessian and G-matrix matrices in high-dimensional classification. By analyzing two canonical models—the linear-classification GMM with a single layer and a two-layer XOR-GMM—the authors prove that, shortly after training begins, SGD trajectories and the outlier eigenspaces of the Hessian and G-matrices align with a common, low-dimensional subspace determined by the data means. The results extend layerwise in multi-layer networks, showing per-layer alignment and even rank-deficient outlier spaces when SGD converges to suboptimal classifiers. Central to the proofs are population matrix decompositions into low-rank outliers plus small bulk, ballistic-limit SGD dynamics for summary statistics, and uniform concentration of empirical matrices; these are then extended to train-data matrices. Overall, the paper provides a principled, quantitative account of the spectral predictions from numerical studies, connecting them to provable dynamical and algebraic structures in overparameterized classification tasks.

Abstract

We rigorously study the relation between the training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, both the SGD trajectory and emergent outlier eigenspaces of the Hessian and gradient matrices align with a common low-dimensional subspace. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.
Paper Structure (43 sections, 35 theorems, 240 equations, 12 figures)

This paper contains 43 sections, 35 theorems, 240 equations, 12 figures.

Key Result

Theorem 2.3

Consider the mixture of $k$-Gaussians with loss function from (eq:cross-entropy-loss), and SGD (eq:SGD-def) with learning rate $\delta = O(1/d)$, regularizer $\beta>0$, initialized from $\mathcal{N}(0,I_d/d)$. There exists $\alpha_0, \lambda_0$ such that if $\lambda\ge \lambda_0$, and $\widetilde{M} for every $c\in [k]$, up to $O(\varepsilon + \lambda^{-1})$ error, for all $\ell \in [T_0\delta^{-1

Figures (12)

  • Figure 2.1: The alignment of the SGD trajectory $\mathbf{x}_\ell^c$ with $E_k(\nabla^2_{cc}\widehat{R}(\mathbf{x}_\ell))$ (left) and $E_k(\widehat{G}_{cc}(\mathbf{x}_\ell))$ (right), for $c\in [k]$ (shown in different colors). The $x$-axis is rescaled time, $\ell \delta$. The parameters are $k=10$ classes in dimension $d=1000$ with $\lambda=10$, $\beta = 0.01$, and $\delta = 1/d$.
  • Figure 2.2: From left to right: Plot of entries of $\mathbf{x}_\ell^1$ and the $k$ leading eigenvectors (in different colors) of $\nabla_{11}^2\widehat{R}(\mathbf{x}_\ell)$ and $\widehat{G}_{11}(\mathbf{x}_\ell)$ respectively at the end of training, namely $\ell = 50\cdot d=25,000$ steps. Here the $x$-axis represents the coordinate index. The parameters are the same as in Fig. \ref{['fig:KGMM-topspaces']} and the means are $\mu_i = e_{i*50}$.
  • Figure 2.3: Left: the eigenvalues (in different colors) of $\nabla^2 \widehat{R}_{11}(\mathbf{x}_\ell)$ over the course of training. The leading $k$ eigenvalues are separated from the bulk at all times, and the top eigenvalue, corresponding to $\mu_1$ separates from the remaining eigenvalues soon after initialization. Right: the inner product of $\mathbf{x}_\ell^1$ with the means $\mu_1,...,\mu_k$ undergoes a similar separation over the course of training. Parameters are the same as in preceding figures.
  • Figure 2.4: (a) and (b) depict the alignment of the first layer weights $W_i(\mathbf{x}_\ell)$ for $i=1,...,K$ (in different colors) with the principal subspaces of the corresponding blocks of the Hessian and G-matrices, i.e., with $E_2(\nabla^2_{W_i W_i} \widehat{R}(\mathbf{x}_\ell))$ and $E_2(\widehat{G}_{W_i W_i}(\mathbf{x}_\ell))$. (c) and (d) plot the second-layer alignment, namely of $v(\mathbf{x}_\ell)$ with $E_4(\nabla^2_{vv} \widehat{R}(\mathbf{x}_\ell))$ and $E_4(\widehat{G}_{vv}(\mathbf{x}_\ell))$. Parameters are $d=1000$, $\lambda=10$, and $K=20$
  • Figure 2.5: The eigenvalues (in different colors) of the $vv$ blocks of the Hessian and G-matrices over time from a random initialization. Initially, there is one outlier eigenvalue due to the positivity of the ReLU activation. Along training, four outlier eigenvalues separate from the bulk, corresponding to the four "hidden" classes in the XOR problem. Parameters are the same as in Figure \ref{['fig:XOR-topspaces']}.
  • ...and 7 more figures

Theorems & Definitions (75)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.3
  • Theorem 2.4
  • Theorem 2.5
  • Theorem 2.6
  • Remark 1
  • Remark 2
  • Theorem 2.7
  • Remark 2.8
  • ...and 65 more