Table of Contents
Fetching ...

The Breakdown of Gaussian Universality in Classification of High-dimensional Linear Factor Mixtures

Xiaoyi Mai, Zhenyu Liao

TL;DR

The paper analyzes high‑dimensional ridge‑regularized ERM for binary classification under Linear Factor Mixture Models (LFMM) and shows that Gaussian universality can break down beyond means and covariances, making learning performance dependent on higher‑order data details. It develops a leave‑one‑out analysis and derives a self‑consistent system of equations for $\theta$, $\eta$, $\gamma$, and $\omega_k$ that fully characterize asymptotic training and test scores, with $\hat{\boldsymbol{\beta}} \simeq (\lambda I_p+\theta \Sigma)^{-1}(\eta \mu+\sum_k \omega_k v_k+\gamma \Sigma^{1/2} u)$ and $r = y m + \sigma \tilde e + \sum_k \psi_k e_k$. The authors identify exact conditions under which Gaussian universality holds—namely Gaussian informative factors or linearity of the loss derivative—and show that non‑Gaussian factors induce non‑universal behavior, impacting loss design and classifier performance. These results quantify when Gaussian‑based approximations remain valid in high‑dimensional classification and highlight the potential to tailor losses to data distribution and sample size for improved performance in LFMM settings.

Abstract

The assumption of Gaussian or Gaussian mixture data has been extensively exploited in a long series of precise performance analyses of machine learning (ML) methods, on large datasets having comparably numerous samples and features. To relax this restrictive assumption, subsequent efforts have been devoted to establish "Gaussian equivalent principles" by studying scenarios of Gaussian universality where the asymptotic performance of ML methods on non-Gaussian data remains unchanged when replaced with Gaussian data having the same mean and covariance. Beyond the realm of Gaussian universality, there are few exact results on how the data distribution affects the learning performance. In this article, we provide a precise high-dimensional characterization of empirical risk minimization, for classification under a general mixture data setting of linear factor models that extends Gaussian mixtures. The Gaussian universality is shown to break down under this setting, in the sense that the asymptotic learning performance depends on the data distribution beyond the class means and covariances. To clarify the limitations of Gaussian universality in the classification of mixture data and to understand the impact of its breakdown, we specify conditions for Gaussian universality and discuss their implications for the choice of loss function.

The Breakdown of Gaussian Universality in Classification of High-dimensional Linear Factor Mixtures

TL;DR

The paper analyzes high‑dimensional ridge‑regularized ERM for binary classification under Linear Factor Mixture Models (LFMM) and shows that Gaussian universality can break down beyond means and covariances, making learning performance dependent on higher‑order data details. It develops a leave‑one‑out analysis and derives a self‑consistent system of equations for , , , and that fully characterize asymptotic training and test scores, with and . The authors identify exact conditions under which Gaussian universality holds—namely Gaussian informative factors or linearity of the loss derivative—and show that non‑Gaussian factors induce non‑universal behavior, impacting loss design and classifier performance. These results quantify when Gaussian‑based approximations remain valid in high‑dimensional classification and highlight the potential to tailor losses to data distribution and sample size for improved performance in LFMM settings.

Abstract

The assumption of Gaussian or Gaussian mixture data has been extensively exploited in a long series of precise performance analyses of machine learning (ML) methods, on large datasets having comparably numerous samples and features. To relax this restrictive assumption, subsequent efforts have been devoted to establish "Gaussian equivalent principles" by studying scenarios of Gaussian universality where the asymptotic performance of ML methods on non-Gaussian data remains unchanged when replaced with Gaussian data having the same mean and covariance. Beyond the realm of Gaussian universality, there are few exact results on how the data distribution affects the learning performance. In this article, we provide a precise high-dimensional characterization of empirical risk minimization, for classification under a general mixture data setting of linear factor models that extends Gaussian mixtures. The Gaussian universality is shown to break down under this setting, in the sense that the asymptotic learning performance depends on the data distribution beyond the class means and covariances. To clarify the limitations of Gaussian universality in the classification of mixture data and to understand the impact of its breakdown, we specify conditions for Gaussian universality and discuss their implications for the choice of loss function.
Paper Structure (24 sections, 5 theorems, 126 equations, 10 figures)

This paper contains 24 sections, 5 theorems, 126 equations, 10 figures.

Key Result

Theorem 1

Let Assumptions ass:loss, ass:LFMM, and ass:growth-rate hold, for $\hat{ \boldsymbol{\beta} }$ solution to the ERM problem in eq:opt-origin-reg on a training set $\{ ({\mathbf{x}}_i, y_i )\}_{i=1}^n$ of size $n$ drawn i.i.d. $({\mathbf{x}}_i, y_i )\sim\mathcal{D}_{({\mathbf{x}},y)}$ from the LFMM in for any deterministic feature vector $\boldsymbol{\nu} \in\mathbb{R}^p$, and where for Gaussian v

Figures (10)

  • Figure 1: Theoretical and empirical distribution of predicted scores $\hat{ \boldsymbol{\beta} }^ {\sf T} {\mathbf{x}}'$ for some fresh $({\mathbf{x}}',y')\sim\mathcal{D}_{({\mathbf{x}},y)}$ independent of $\hat{ \boldsymbol{\beta} }$. The theoretical probability densities ( red) are obtained from \ref{['theo:main']}, and the empirical histograms ( blue) are the values of $\hat{ \boldsymbol{\beta} }^ {\sf T} {\mathbf{x}}'$ over $10^6$ independent copies of ${\mathbf{x}}'$, for three different LFMMs as in \ref{['def:linear_factor']} with $n=600$, $p=200$, $\rho=0.5$, $s=[\sqrt{2};\mathbf{0}_{p-1}]$ (so that $q = 1$), and Haar distributed ${\mathbf{V}}$. Left: normal $e_1$ and uniformly distributed $e_2,\ldots,e_p$; Middle: normal $e_1,\ldots,e_p$; Right: uniformly distributed $e_1$, and normal $e_2,\ldots,e_p$.
  • Figure 2: Empirical and theoretical results under an LFMM with $p=200$, $\rho=0.5$, $s=[\sqrt{2};\mathbf{0}_{p-1}]$, Rademacher $e_1$, normal $e_2,\ldots,e_p$, and Haar distributed ${\mathbf{V}}$. Top: scatter plot of $200$ independent $[r,h_\kappa(r,\pm1)]$. Bottom: histograms of predicted scores on $10^6$ fresh samples $({\mathbf{x}}',y')\sim\mathcal{D}_{({\mathbf{x}},y)}$ given by $\hat{ \boldsymbol{\beta} }$ and $\hat{ \boldsymbol{\beta} }^{\mathbf{g}}$, versus theoretical densities obtained from \ref{['theo:main']}. Left: $n=100$, square loss $\ell(\hat{y},y)=(\hat{y}-y)^2/2$. Right: $n=600$, square hinge loss $\ell(\hat{y},y)=\max\{0,(1-\hat{y} y)\}^2$.
  • Figure 3: Empirical classification accuracy of $\hat{ \boldsymbol{\beta} }_{\ell,\lambda}$ computed over $10^5$ independent copies of $({\mathbf{x}}',y')\sim\mathcal{D}_{({\mathbf{x}},y)}$ and averaged over $100$ trials with a width of $\pm 1$ standard deviation, versus theoretical performance given in \ref{['theo:main']}, given by the square loss $\ell(\hat{y}, y) = (y-\hat{y})^2/2$ and the logistic loss $\ell(\hat{y}, y) = -\ln(1/(1+e^{-y\hat{y}}))$ and on $n=800$ training samples. Left: GMM under \ref{['def:linear_factor']} with $p=200$, $\rho=0.5$, $s=[1,5;0.5;{\mathbf{0}}_{p-2}]$ (so that $q = 2$), and ${\mathbf{V}}={\rm diag}(2,{\mathbf{1}}_{p-1}){\mathbf{H}}$ with Haar distributed ${\mathbf{H}}$. Right: LFMM identical to the GMM in the left, but with Rademacher $e_1$.
  • Figure 4: Histogram of the first and second information factors of Class $4$ and $5$, estimated using all samples from the Fashion-MNIST dataset.
  • Figure 5: Histogram of the first and second information factors of Class $3$ and $7$, estimated using all samples from the Fashion-MNIST dataset.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Definition 1: Linear factor mixture model, LFMM
  • Theorem 1: Asymptotic distribution of predicted scores
  • Corollary 1: Asymptotic generalization and training performances
  • Remark 1: On classifier bias under GMM and LFMM
  • Definition 2: Equivalent Gaussian mixture model
  • Definition 3: Gaussian universality under LFMM
  • Corollary 2: Condition of Gaussian universality on in-distribution performance
  • Remark 2: Connection to conditional one-directional CLT in \ref{['eq:clt for CGEP']}
  • Corollary 3: Condition of Gaussian universality on classifier
  • Remark 3: Limitation of square loss
  • ...and 2 more