Table of Contents
Fetching ...

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

Lorenzo Bardone, Sebastian Goldt

TL;DR

It is shown that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations.

Abstract

Neural networks extract features from data using stochastic gradient descent (SGD). In particular, higher-order input cumulants (HOCs) are crucial for their performance. However, extracting information from the $p$th cumulant of $d$-dimensional inputs is computationally hard: the number of samples required to recover a single direction from an order-$p$ tensor (tensor PCA) using online SGD grows as $d^{p-1}$, which is prohibitive for high-dimensional inputs. This result raises the question of how neural networks extract relevant directions from the HOCs of their inputs efficiently. Here, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks.

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

TL;DR

It is shown that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations.

Abstract

Neural networks extract features from data using stochastic gradient descent (SGD). In particular, higher-order input cumulants (HOCs) are crucial for their performance. However, extracting information from the th cumulant of -dimensional inputs is computationally hard: the number of samples required to recover a single direction from an order- tensor (tensor PCA) using online SGD grows as , which is prohibitive for high-dimensional inputs. This result raises the question of how neural networks extract relevant directions from the HOCs of their inputs efficiently. Here, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks.
Paper Structure (41 sections, 6 theorems, 66 equations, 2 figures, 1 table)

This paper contains 41 sections, 6 theorems, 66 equations, 2 figures, 1 table.

Key Result

Proposition 1

Considering only the spike in the covariance the generative distribution is $y\sim \text{Rademacher}\left(1/2\right)$, A spherical perceptron that satisfies assumptions:sigma and is trained on the correlation loss eq:corrloss using online SGD eq:onlineSGD has the following result concerning the overlap with the hidden direction $\alpha_{u,t}:=u\cdot w_t$:

Figures (2)

  • Figure 1: Correlated latent variables speed up learning of neural networks.A Test error of a two-layer neural network trained on the mixed cumulant model (MCM) of \ref{['eq:mixed-cumulant-model']} with signal-to-noise ratios $\beta_m=1, \beta_u=5, \beta_v = 10$. The MCM is a binary classification tasks where the inputs in the two classes have a different mean, a different covariance, and different higher-order cumulants (HOCs). We show the test error on the full data set (red) and on several "censored" data sets: a test set where only the mean of the inputs is different in each class (blue, $\beta_m=1, \beta_u=\beta_v=0$), a test set where mean and covariance are different (green, $\beta_m=1, \beta_u = 5, \beta_v=0$), and a Gaussian mixture that is fitted to the true data set (orange). The neural networks learn distributions of increasing complexity: initially, only the difference means matter, as the blue and red curves coincide; later, the network learns about differences at the level of the covariance, and finally at the level of higher-order cumulants. B Test loss of a two-layer neural network trained on CIFAR10 and evaluated on CIFAR10 (red), a Gaussian mixture with the means fitted to CIFAR10 (blue) and a Gaussian mixture with the means and covariance fitted on CIFAR10 (orange). C Same setup as in A, but here the latent variables corresponding to the covariance and the cumulants of the inputs are correlated, leading to a significant speed-up of learning from HOCs (the red and orange line separate after $\gtrsim 10^{4}$ steps, rather than $\gtrsim 10^6$ steps). Parameters:$\beta_m=1, \beta_u=5, \beta_v=10, d=128, m=512$ hidden neurons, ReLU activation function. Full details in \ref{['app:figure-details']}.
  • Figure 2: Staircases in the teacher-student setup.A Test accuracy of the same two-layer neural networks as in \ref{['fig:figure1']} evaluated on the degree-4 target function $y^*(x)$\ref{['eq:teacher']} during training on the target functions $y^{(1)}(x) = h_1(m \cdot x )$ (blue), $y^{(2)}(x) = h_1(m \cdot x) + h_2(u \cdot x)$ (green), and the teacher function \ref{['eq:teacher']} (red). Inputs are drawn from the standard multivariate Gaussian distribution. B-D We show the the average of the top-5 largest normalised overlaps $w_k \cdot u$ of the weights of the $k$th hidden neuron $w_k$ and the three directions that need to be learnt for three different target functions: the teacher function in \ref{['eq:teacher']} (B), the same teacher with inputs that have a covariance $\mathbbm{1} + u v^\top+vu^{\top}$ (C), and a teacher with mixed terms, \ref{['eq:teacher-mixed']}(D). The dashed black line is at $d^{-1/2}$, the threshold for weak recovery. Parameters: Simulation parameters as in \ref{['fig:figure1']}: $d=128, m=512$ hidden neurons, ReLU activation function. Full details in \ref{['app:figure-details']}.

Theorems & Definitions (8)

  • Proposition 1: Covariance spike only
  • Proposition 2: Cumulant spike only
  • Proposition 3
  • Proposition 4
  • Remark 5
  • Lemma 6
  • Lemma 7
  • proof