Table of Contents
Fetching ...

Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath

TL;DR

The paper develops a precise, dimension-free spectral theory for self-coupled empirical matrices arising in high-dimensional loss landscapes, showing that the bulk spectrum and outliers depend only on a low-dimensional Gram summary $\mathbf{G}=(\mathbf{x},\boldsymbol{\mu})^T(\mathbf{x},\boldsymbol{\mu})$. It then connects SGD dynamics to an autonomous evolution of these summaries, enabling tracking of spectral transitions along training trajectories, with explicit results for high-dimensional logistic regression on Gaussian mixtures. The framework encompasses a broad class of problems, including multi-layer GMM classification and multi-index regression, and yields both static initialization results and dynamical outlier-splitting along SGD, providing sharp BBP-type thresholds and deterministic equivalents. The work offers a powerful tool for understanding when informative spectral directions emerge during learning, and how the interplay between bulk and outliers shapes optimization in high dimensions, with potential implications for understanding generalization and training dynamics in structured models.

Abstract

We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n \in \mathbb{R}^d$, are i.i.d. draws of a $k$-Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We provide exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=φ\in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\boldsymbolμ)^\top(\mathbf{x},\boldsymbolμ)$. It is known that under general conditions, when $\mathbf{x}$ is trained by online stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the training dynamics to the spectral theory of these matrices generated with test data. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.

Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

TL;DR

The paper develops a precise, dimension-free spectral theory for self-coupled empirical matrices arising in high-dimensional loss landscapes, showing that the bulk spectrum and outliers depend only on a low-dimensional Gram summary . It then connects SGD dynamics to an autonomous evolution of these summaries, enabling tracking of spectral transitions along training trajectories, with explicit results for high-dimensional logistic regression on Gaussian mixtures. The framework encompasses a broad class of problems, including multi-layer GMM classification and multi-index regression, and yields both static initialization results and dynamical outlier-splitting along SGD, providing sharp BBP-type thresholds and deterministic equivalents. The work offers a powerful tool for understanding when informative spectral directions emerge during learning, and how the interplay between bulk and outliers shapes optimization in high dimensions, with potential implications for understanding generalization and training dynamics in structured models.

Abstract

We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, , are i.i.d. draws of a -Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely , where are the parameters, and need not equal . This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We provide exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension and . These limits depend on the parameters only through the summary statistic of the Gram matrix of the parameters and class means, . It is known that under general conditions, when is trained by online stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the training dynamics to the spectral theory of these matrices generated with test data. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.

Paper Structure

This paper contains 39 sections, 38 theorems, 231 equations, 7 figures.

Key Result

Theorem 1.3

Fix $M\in\{H,G\}$, $\alpha\in[\mathcal{C}]$, and $\lambda,\phi>0$. Let $n,d$ get large with $n/d=\phi$. For any $\mathbf x\in \mathbb R^{\mathcal{C} d}$ with summary statistics matrix ${\mathbf G} \in \mathbb R^{q\times q}$, we have that with probability $1-o(1)$, Furthermore, given a sequence of parameters $\mathbf{x}^{(d)}$ with summary statistic matrices ${\mathbf G}^{(d)}\to {\mathbf G}$ as $

Figures (7)

  • Figure 1.1: Histogram of the empirical Hessian spectrum (orange) at various points in parameter space in $d=20,000$$k=3$, $\lambda =3$ and $\phi=4$. The parameter values plotted are, from left to right, at $\mathbf{x} \equiv 0$, at $\mathbf{x} \equiv \mu+ \mathcal{N}(0,I_d/d)$, and at the optimal classifier $\mathbf {x}^\alpha = (1-\frac{1}{k})\mu^\alpha - \sum_{j\ne \alpha}\frac{1}{k} \mu_j$. Arrows point to the empirical locations of the outlier eigenvalues for visibility, and the blue curve is the theoretically predicted bulk spectrum at those summary statistic values.
  • Figure 1.2: From left to right: the histogram of the empirical Gradient matrix (orange) at $\lambda = 1,2,20$ respectively, in $d= 20,000$ and $k=3$ and $\phi =4$ at initialization, with arrows pointing at the empirical locations of the outlier eigenvalues. This demonstrates the existence of distinct transition SNRs, $\lambda_{c,1}<\lambda_{c,2}$, for existence of one, then $k$ outlier eigenvalues as predicted by Corollary \ref{['cor:exact-distribution-at-initialization']}.
  • Figure 1.3: From left to right: spectra of the empirical Hessian computed with test data (orange histogram) overlaid with the predicted bulk spectrum given its summary statistic values (blue curve) after $0$, $n/5$ and $3n/5$ steps of online SGD, in $d=20,000$ with $k=3, \phi=4$. The figures demonstrate the splitting of the multiplicity-$k$ outlier eigenvalue, into a more pronounced outlier corresponding to the mean being learned by the classifier being trained, and the remaining $k-1$, as predicted by Theorem \ref{['thm:effective-bulk-along-effective-dynamics']}.
  • Figure 1.4: The analogue of Figure \ref{['fig:Hessian-spectrum-evolution-test']} with train data used to generate the empirical Hessian matrix rather than test data. The similarity to Figure \ref{['fig:Hessian-spectrum-evolution-test']} shows a good match regardless of which data set is used to generate the matrices.
  • Figure 1.5: The evolution of the empirical (first-layer) Hessian spectrum (orange histogram) for the XOR GMM task, together with the predicted bulk spectrum (blue curve) and outlier locations (arrows) after $0$, $n/3$ and $2n/3$ steps of online SGD. $d=20,000$, $k=3$, and $\lambda = \phi =4$.
  • ...and 2 more figures

Theorems & Definitions (79)

  • Definition 1.1
  • Definition 1.2
  • Theorem 1.3: Bulk
  • Theorem 1.4: Outliers
  • Corollary 1.5
  • Corollary 1.6
  • Theorem 1.7
  • Corollary 1.8
  • Remark 1.9
  • Theorem 1.10: Bulk
  • ...and 69 more