Table of Contents
Fetching ...

Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization

Chiheb Yaakoubi, Cosme Louart, Malik Tiomoko, Zhenyu Liao

Abstract

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the (generally non-Gaussian) distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.

Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization

Abstract

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean and covariance of the ERM estimator . Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate independent of the training data, the projection approximately follows the convolution of the (generally non-Gaussian) distribution of with an independent centered Gaussian variable of variance . This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at . Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.

Paper Structure

This paper contains 26 sections, 20 theorems, 170 equations, 4 figures, 1 table.

Key Result

Theorem 2.1

Under Assumptions ass:design and ass:regularity, there exist constants $C,c>0$, independent of $n,p$, such that for any $1$-Lipschitz $f:\mathbb{R}^p\to\mathbb{R}$, $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Non-Gaussian Decision Scores and Classification Error.Left: Empirical histograms of decision scores for Class 0 (light blue) and Class 1 (light red) exhibit non-Gaussian distributions that align closely with theoretical predictions (dashed blue). Gaussian approximations (green dashed) fail to capture the skewness and bimodality per class. Right: Classification error as a function of the regularization parameter. The empirical error (red) closely matches the theoretical non-Gaussian model (solid blue), while the Gaussian assumption on the score (green dashed) leads to an overestimation of the error and the Gaussian assumption on the data (purple dashed) leads to an underestimation of the error.
  • Figure 2: We examine different score distributions for various regularization functions $\rho: \theta \mapsto a^\top \theta + \|\theta\|^2$, where $a = (-\cos(\phi), \sin(\phi), 0,\ldots, 0)$ for some angle $\phi = 0, \frac{\pi}{2}, \pi$ (from bottom left to bottom right). We use the squared loss $\mathcal{L}_y(z) = (z-y)^2$, with $y = \theta^*{}^\top x + \varepsilon$, where $\theta^* = e_1$. For all $i \in [p] \setminus {2}$, $x^\top e_i \sim \mathcal{N}(0,1)$, while $x^\top e_2$ follows a bimodal distribution. According to Corollary \ref{['cor:gaussian_projection']}, the score is Gaussian when $x^\top a$ is Gaussian, which occurs here only at the extremal points of the graph ($\phi\in\{0,\pi\}$). The error (top left) is minimized when $a = -e_1$ (i.e. $\phi = 0$), in which case $\theta^*$ minimizes $\rho$.
  • Figure 3: Denoting $F_a := F + \mathrm{span}(a)$ and $F_a^\perp := (F + \mathrm{span}(a) )^\perp$, and write $P_E$ for the orthogonal projection onto a subspace $E$. We define $\mathcal{J}_\mu(\mu) := \mathbb{E}\!\left[ e_{\mathcal{L}_y}\!\left(\mu^\top x + \alpha z;\kappa\right) \right] + \rho(\mu),$ the part minimized by $\mu_*$ in $\mathcal{J}$. (left) Comparison of the values of $t \mapsto \mathcal{J}_\mu(\mu_* + t u)$ for a random direction $u = V$ (solid line), $u = V_{F_a} = P_{F_a} V$ (dashed line), $u = V_F = P_F V$ (dash--dot line), and the null direction $u = 0$ (bold solid line). (right) From left to right and from top to bottom we display the following distribution: $x^\top \mu_\ast,\,(\text{P}_Fx)^\top \mu_\ast,\,(\text{P}_{F_a}x)^\top \mu_\ast$ and $(\text{P}_{F_a^\perp}x)^\top \mu_\ast$.
  • Figure 4: Universality breakdown on MNIST data. Left: Empirical histograms of decision scores for Class 0 (light blue) and Class 1 (light red) of $\hat{\theta}^\top x$, compared with a Gaussian approximation of matching mean and variance (green dashed) and with the corrected theoretical density(dashed blue). Right: Generalization performance. Predictions based on Gaussian score universality (green dashed) fail to match empirical results, while corrected theoretical predictions are accurate. Performance obtained by replacing the data with a moment-matched Gaussian surrogate (purple dashed) also exhibits a substantial mismatch.

Theorems & Definitions (37)

  • Theorem 2.1: Concentration of $\hat{\theta}$
  • Corollary 2.2: louart2024operation
  • Proposition 3.1
  • Lemma 3.2
  • Theorem 3.3: Quadratic universality of regularization
  • Claim 4.1: Generalized CGMT for concentrated column designs
  • Remark 4.2
  • Theorem 4.3: Min--max formulation of limiting asymptotics
  • Theorem 4.4: Fixed-point formulation
  • Corollary 5.1: Generalization error of Ridge regression
  • ...and 27 more