Gaussian Universality of Perceptrons with Random Labels

Federica Gerace; Florent Krzakala; Bruno Loureiro; Ludovic Stephan; Lenka Zdeborová

Gaussian Universality of Perceptrons with Random Labels

Federica Gerace, Florent Krzakala, Bruno Loureiro, Ludovic Stephan, Lenka Zdeborová

TL;DR

The paper addresses whether Gaussian-design results for perceptrons with random labels extend to realistic, high-dimensional data. It develops a rigorous universality framework showing that mixtures of Gaussians with random labels are effectively equivalent to a single Gaussian design with matching covariance in the proportional regime, and that this universality holds for generic convex losses as regularization vanishes, with a strong form for ridge regression. The authors derive exact asymptotic expressions via replica-analysis and validate them with extensive numerical experiments on real datasets preprocessed by random features or scattering transforms. The findings illuminate why Gaussian-based theory often captures practical learning behavior and offer a path toward analytically tractable insights for high-dimensional learning on real data, with potential extensions to non-convex losses and deeper networks.

Abstract

While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real datasets.

Gaussian Universality of Perceptrons with Random Labels

TL;DR

Abstract

Paper Structure (27 sections, 5 theorems, 44 equations, 8 figures, 2 algorithms)

This paper contains 27 sections, 5 theorems, 44 equations, 8 figures, 2 algorithms.

Introduction
Setting, notation, and Asymptotic formulas
The main theoretical results: from mixtures to a single Gaussian
Mean invariance with random labels
Generic loss with vanishing regularisation
Ridge regression with vanishing regularization
Numerical experiments
Experiments with finite regularization ---
Experiments with vanishing regularization ---
Homogeneity assumption ---
A remark on Rademacher complexity ---
Conclusion
Exact asymptotic performances of GCM and GMM
Preliminaries: the setting
Note on scalings --
...and 12 more sections

Key Result

Lemma 1

In the random label setting eq:y, assume that the loss $\ell$ is symmetric, in the sense that $\ell(x, y) = \ell(-x, -y)$ for $x, y \in \mathbb{R}$. Then, the limiting value $\mathcal{E}_{\ell}$ of the risk is independent from the means, i.e. for all choices of $\bm\rho$, $\bm M$ and $\bm\Sigma^\oti

Figures (8)

Figure 1: Training loss as function of the number of samples $n$ per input dimension $p$ at regularization $\lambda = 10^{-15}$. In the left panel the square loss, and in the right panel the hinge loss. The black solid line represents the outcome of the replica calculation for i.i.d Gaussian inputs, namely when the covariance matrix $\Sigma$ corresponds to the identity matrix. Dots refer to numerical simulations on different full-rank datasets. In particular, blue dots correspond to MNIST with Gaussian random features and error function non-linearity, red dots correspond to fashion-MNIST with wavelet scattering transform, green dots correspond to CIFAR10 in grayscale with Gaussian random features and ReLU non-linearity, yellow dots corresponds to a mixture of Gaussians, with means $\bm{\mu}_{\pm} = \left( \pm 1, 0,...,0 \right)$, covariances $\Sigma_{\pm}$ both equal to the identity matrix and relative class proportions $\rho_{\pm} = 1/2$. Finally, black dots correspond to i.i.d. Gaussian inputs.
Figure 2: This figure shows the training loss as a function of the number of samples $n$ per dimension $p$ at finite regularization $\lambda$. In the top panel the square loss, and in the bottom panel the hinge loss. The first column refers to MNIST with Gaussian random features and error function non-linearity, the second column corresponds to fashion-MNIST with wavelet scattering transform, the third column corresponds to CIFAR10 in grayscale with Gaussian random features and ReLU non-linearity, the fourth column corresponds to a mixture of Gaussians, with means $\bm{\mu}_{\pm} = \left( \pm 1, 0,...,0 \right)$, covariances $\Sigma_{\pm}$ both equal to the identity matrix and relative class proportions $\rho_{\pm} = 1/2$. Black solid lines correspond to the outcome of the replica calculation, obtained by assigning to $\Sigma$ the covariance matrix of each dataset plus the corresponding transformation. The coloured dots correspond to the simulations for different values of $\lambda$, as specified in the plot legend. Simulations are averaged over $10$ samples & the error bars are not visible at the plot scale.
Figure 3: Ridge/square loss (left) & hinge loss for a single Gaussian vs a mixture of inhomogeneous Gaussians at finite $\lambda$. Lines are the asymptotic exact results while dots are simulation ($p\!=\!900$, dark lines for mixture, lighter ones for single Gaussian). When the homogeneity assumption is not obeyed, then a mixture of two Gaussians does not yield equal results to those of a single Gaussian with matching covariance. (Here, a mixture with zero mean and a block covariance with, resp. diagonal elements equal to $0.01$, $0.98$ and $0.01$ for the first one, and $0.495$ and $0.01$, $0.495$ for the second). Note however that the universality is restored in the Ridge case when $\lambda \to 0$, as stated in Theorem \ref{['thm:ols']}. It is also very well obeyed with large enough $\lambda$ and deviations appear small in general.
Figure 4: Numerical simulations of universality: As in Fig. \ref{['fig:all_datasets_finite_reg']}, this figure shows the training loss as a function of the number of samples $n$ per dimension $p$ at various values of $\lambda$ for another data set we used here for completeness. Here we used a grayscale tiny-Imagenet pre-processed with Gaussian random features and $\hbox{tanh}$ non-linearity. In the left panel the square loss, in the middle panel the logistic loss and in the right panel the hinge loss. The coloured dots refer to numerical simulations while the black solid lines correspond to the theoretical prediction of single Gaussian with corresponding input covariance matrices. The numerical simulations are averaged over $10$ different realizations.
Figure 5: This figure shows the training loss as a function of the number of samples $n$ per dimension $p$ at finite regularization $\lambda$. In the top panel the square loss, and in the bottom panel the hinge loss. The first column refers to MNIST, the second column corresponds to fashion-MNIST, the third column corresponds to CIFAR10 in grayscale, the fourth column corresponds to tiny ImageNet in grayscale. Black solid lines correspond to the outcome of the replica calculation, obtained by assigning to $\Sigma$ the covariance matrix of each dataset. The coloured dots correspond to the simulations for different values of $\lambda$, as specified in the plot legend. Simulations are averaged over $10$ samples & the error bars are not visible at the plot scale.
...and 3 more figures

Theorems & Definitions (7)

Lemma 1: Single mean lemma for random labels
Theorem 3: Gaussian universality for random labels
Theorem 4: Gaussian universality for vanishing regularization
proof
Theorem 5: Strong universality for ridge loss
Lemma 2
proof

Gaussian Universality of Perceptrons with Random Labels

TL;DR

Abstract

Gaussian Universality of Perceptrons with Random Labels

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (7)