Table of Contents
Fetching ...

When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling

Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz

TL;DR

This work identifies a non-universal breakdown of Gaussian equivalence for random features in the quadratic scaling regime, where the target depends on a low-dimensional data projection. It introduces the Conditional Gaussian Equivalent (CGE) model, augmenting the Gaussian surrogate with a small, low-dimensional non-Gaussian component to capture essential chaos that GET misses. The authors prove sharp asymptotics for training and test errors under CGE using a two-phase Lindeberg swapping strategy and Malliavin-Stein-based CLTs, with a intermediary Partial Gaussian Equivalent (PGE) model bridging the gap. They further demonstrate that CGE accurately predicts phenomena such as generalized linear model behavior, phase transitions, interpolation thresholds, double descent, and benign overfitting in RF in the quadratic regime, offering a robust framework beyond GET for high-dimensional ERM universality.

Abstract

A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.

When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling

TL;DR

This work identifies a non-universal breakdown of Gaussian equivalence for random features in the quadratic scaling regime, where the target depends on a low-dimensional data projection. It introduces the Conditional Gaussian Equivalent (CGE) model, augmenting the Gaussian surrogate with a small, low-dimensional non-Gaussian component to capture essential chaos that GET misses. The authors prove sharp asymptotics for training and test errors under CGE using a two-phase Lindeberg swapping strategy and Malliavin-Stein-based CLTs, with a intermediary Partial Gaussian Equivalent (PGE) model bridging the gap. They further demonstrate that CGE accurately predicts phenomena such as generalized linear model behavior, phase transitions, interpolation thresholds, double descent, and benign overfitting in RF in the quadratic regime, offering a robust framework beyond GET for high-dimensional ERM universality.

Abstract

A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.

Paper Structure

This paper contains 73 sections, 50 theorems, 639 equations, 5 figures.

Key Result

Theorem 3.9

Suppose Assumptions assumption:scaling--assumption:activation hold. Then, there exist constants $c,d_0>0$ depending only on the constants in these assumptions, such that for all $d \geq d_0$ and any twice-differentiable function $\varphi : \mathbb{R} \to \mathbb{R}$ with $\| \varphi \|_\infty, \| \ In particular, for every $\varepsilon \in (0,1)$ and $\kappa \in \mathbb{R}$, Consequently, for al

Figures (5)

  • Figure 1: Universality and non-universality of the test error, with each quadrant representing a combination of either $\ell_2$ loss (quadratic) or hinge loss (non-quadratic) for the training loss and either $\ell_2$ loss (quadratic) or 0-1 loss (non-quadratic) for the test loss. We choose $d=50$, $n/d^2 = 0.5$, $\lambda = 10^{-3}$ and $\sigma(x) = \max\{0,x\}$, and consider two different responses: (i) a "single-index" response: $y_{\rm SI} = 1 + 2 {\rm He}_2(\langle {\boldsymbol u}_{*}, {\boldsymbol x}\rangle) + {\rm He}_3(\langle{\boldsymbol u}_{*}, {\boldsymbol x}\rangle)$; and (ii) a "random-poly" response: $y_{\rm R} = 1 + 2 {\boldsymbol \beta}_{ 2}^{\mathsf T} {\boldsymbol h}_{2} ({\boldsymbol x}) + {\boldsymbol \beta}_{ 3}^{\mathsf T} {\boldsymbol h}_{3} ({\boldsymbol x})$, where ${\boldsymbol \beta}_{ k} \sim {\rm Unif} \left( \mathbb{S}^{B_{d,k} - 1} \right)$. Both responses result in the same GE model predictions.
  • Figure 2: Marginal distributions of $\langle\hat{{\boldsymbol \theta}},\phi_{{\sf RF}} ({\boldsymbol x})\rangle$ for two different responses $y_{\rm SI}$ and $y_{\rm R}$, where $\hat{{\boldsymbol \theta}}$ is the solution of \ref{['eq:intro_ERM']} with squared loss. The histogram corresponds to the empirical distribution; the curves correspond to the predictions from the CGE model (solid red line) and the GE model (black dashed line). We choose $d=50$, $n/d^2 = 1$, $p/d^2 = 0.5$, $\lambda = 10^{-3}$, and $\sigma(x) = x^2$.
  • Figure 3: (a) and (b): 2D diagram of $\mathbb{P}(y_{{\sf RF}}({\boldsymbol x}) = +1)$, where $y_{{\sf RF}}({\boldsymbol x}) = \operatorname{sign}(\phi_{{\sf RF}} ({\boldsymbol x}))$ and ${\boldsymbol x} \in \mathbb{R}^d$ is a new test sample, with $x_1 \text{ and } x_2$ being given and $(x_3, x_4, \cdots, x_d) \sim \mathcal{N}(0,{\mathbf I}_{d-3})$. We choose $d=30$, $\lambda = 10^{-3}$, and $y = \operatorname{sign}\left({\rm He}_2( \frac{x_1 + x_2}{\sqrt{2}} )\right)$. We fix $n/d^2 = 2.5$. The red dash lines correspond to the ground truth boundaries between the two classes $y=+1$ and $y=-1$. (c): Theoretical predictions and empirical results of $\mathbb{P}(y_{{\sf RF}}({\boldsymbol x}) = +1)$ along the line $x_2 = x_1$ in the 2D diagram. The black solid line corresponds to the ground-truth label.
  • Figure 4: Phase diagram for the existence of an interpolating RF model in binary classification in the quadratic scaling regime. Each pixel value represents the empirical probability that the RF model interpolates the training data $(y_i,{\boldsymbol x}_i)_{i\in[n]}$, averaged over 60 independent trials. The red solid lines represent the asymptotic predictions for the phase transition boundary from the CGE model, and the green dashed curves are the asymptotic predictions from the GE model. We fix $d=50$ and $n/d^2 = 0.45$. The target function is $f_*({\boldsymbol x}) = \sum_{k=0}^{3}\mu_{*,k} {\rm He}_k({\boldsymbol u}_{*}^{\mathsf T} {\boldsymbol x})$ and the label is generated by the logistic model: $\mathbb{P}(y = 1 \mid {\boldsymbol x}) = (1 + e^{-s_* f_*({\boldsymbol x})})^{-1}$. Left panel: $\mu_{*,0}=2$, $\mu_{*,1}=1$, $\mu_{*,2}=2$ and $\mu_{*,3}=0.6$ ; Right panel: $s_* = 4$, $\mu_{*,1}=1$, $\mu_{*,2}=2$ and $\mu_{*,3}=0.6$.
  • Figure 5: Benign overfitting and double descent phenomenon in binary classification. We choose $d=40$, $\lambda = 10^{-4}$, and the logistic function $\ell(y,z) = \log(1+e^{-yz})$ for both training and test loss. Left plot: Fix $n/d^2 = 0.75$. Right plot: Fix $p/d^2 = 1$.

Theorems & Definitions (105)

  • Remark 2.1: Beyond ridge penalty
  • Remark 2.2: Fourth Moment Theorem
  • Remark 2.3: General polynomial scaling
  • Remark 3.1
  • Remark 3.2
  • Theorem 3.9: Universality of Training Error
  • Remark 3.3
  • Theorem 3.13: Universality of Test Error
  • Theorem 4.1: Quantitative CLT on Wiener Chaos nualart2005centralnourdin2009stein
  • Theorem 4.2: Replacing higher-order chaos with isotropic gaussian
  • ...and 95 more