Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

Eszter Székely; Lorenzo Bardone; Federica Gerace; Sebastian Goldt

Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

Eszter Székely, Lorenzo Bardone, Federica Gerace, Sebastian Goldt

TL;DR

This work analyses the fundamental statistical and computational limits of recovering the spike by analysing the number of samples n required to strongly distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs, and finds an exact formula for the likelihood ratio norm which proves that statistical distinguishability requires n≳d samples.

Abstract

Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of $d$-dimensional inputs. Existing literature established the presence of a wide statistical-to-computational gap in this problem. We deepen this line of work by finding an exact formula for the likelihood ratio norm which proves that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-ordercorrelations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.

Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

TL;DR

Abstract

cumulants of

-dimensional inputs. Existing literature established the presence of a wide statistical-to-computational gap in this problem. We deepen this line of work by finding an exact formula for the likelihood ratio norm which proves that statistical distinguishability requires

samples, while distinguishing the two distributions in polynomial time requires

samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-ordercorrelations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.

Paper Structure (51 sections, 11 theorems, 133 equations, 8 figures)

This paper contains 51 sections, 11 theorems, 133 equations, 8 figures.

Introduction
Further related work
Detecting spikes in high-dimensional data
NGCA, Gaussian pancakes and low-degree polynomials
Separation between neural networks and random features
Reproducibility
The data models
The Gaussian case
The spiked cumulant model
How many samples do we need to learn?
Statistical distinguishability: LR analysis
Computational distinguishability: LDLR analysis
Statistical-to-computational gaps in the spiked cumulant model
Learning from HOCs with neural networks and random features
Spiked Wishart model
...and 36 more sections

Key Result

Theorem 2

Suppose that $u$ has $i.i.d.$ Rademacher prior and that the non-Gaussian distribution $p_g$ satisfies assumptions:g. Then the norm of the total LR is given by where $f$ is defined as the following average over two independent replicas $g_u,g_v\sim g$ of $g$:

Figures (8)

Figure 1: The performance of an exhaustive-search algorithm corroborates the presence of a phase transition for $\theta=1$, as suggested by \ref{['thm:LRspiked_cumulant']}. Success rate of an exponential-time search algorithm over all the possible spikes in the $d$-hypercube as a function of the exponent $\theta$ that quantifies as $n=d^\theta$ the samples used in the log-likelihood test \ref{['eq:brute-force_estimator']}, in the $g\sim$Radem$(1/2)$ case.
Figure 2: Learning the spiked Wishart task with neural networks and random features.(A,B) Test accuracy of random features (RF) and early-stopping test accuracy of two-layer ReLU neural networks (NN) on the spiked Wishart task, \ref{['eq:wishart']}, with linear and quadratic sample complexity ($n_\mathrm{class}\asymp~d,~d^2$, respectively, where $d$ is the input dimension). Predictions for the performance of random features obtained using replicas are shown in black. (C,D) Maximum normalised overlaps of the networks' first-layer weights with the spike $u$, \ref{['eq:wishart']}. Parameters: $\beta=5$. Neural nets and random features have $m=5d$ hidden neurons. Full experimental details in \ref{['app:experimental-methods']}.
Figure 3: Learning the spiked cumulant task with neural networks and random features.(A, B) Test accuracy of random features (RF) and early-stopping test accuracy of two-layer ReLU neural networks (NN) on the spiked cumulant task \ref{['eq:whitening']} with linear and quadratic sample complexity ($n_\mathrm{class}\asymp~d,~d^2$, respectively, where $d$ is the input dimension). (C, D) Maximum normalised overlaps of the networks' first-layer weights with the spike $u$, \ref{['eq:whitening']}. Parameters: $\beta=10$. Neural nets and random features have $m=5d$ hidden neurons, same optimisation as in \ref{['fig:spiked_wishart']}. Full experimental details in \ref{['app:experimental-methods']}.
Figure 4: A phase transition in the fourth-order cumulant precedes learning from the fourth cumulant.(A) We train neural networks to discriminate inputs sampled from a simple non-Gaussian model for images introduced by ingrosso2022data (top) from Gaussians with the same mean and covariance (bottom). (B) Test error of two-layer neural networks interpolating between the fully-trained ($\alpha=1$) and lazy regimes (large $\alpha$) -- see \ref{['sec:experiments-nlgp']}. (C) The localisation of the leading CP-factor of the non-Gaussian inputs (dashed purple line) and the first-layer weights of the trained networks, as measured by the inverse participation ratio (IPR), \ref{['eq:ipr']}. Large IPR denotes a more localised vector $w$. Parameters: $g=3$, $\xi=1, d=20, m=100$. Full details in \ref{['app:experimental-methods']}.
Figure 5: Learning the spiked Wishart and spiked cumulant task, starting from small initial overlaps. We repeat the neural network experiments on the spiked Wishart (top) and spiked cumulant (bottom) task, see \ref{['fig:spiked_wishart', 'fig:spiked_cumulant']}, while enforcing that all hidden neurons have an overlap of exactly $1/\sqrt d$ with the spikes, by simple explicit orthogonalisation. While the maximum overlaps do indeed decrease for small sample complexities, the qualitative behaviour is unchanged. All hyper-parameters as in \ref{['fig:spiked_wishart', 'fig:spiked_cumulant']}, respectively.
...and 3 more figures

Theorems & Definitions (23)

Theorem 2
Definition 3: Low-degree likelihood ratio (LDLR)
Conjecture 4
Theorem 5: LDLR for spiked cumulant model
Proposition 6: Second Moment Method for Distinguishability
proof
Example 7
proof
Theorem 8: LDLR for spiked Wishart model
proof
...and 13 more

Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

TL;DR

Abstract

Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (23)