Table of Contents
Fetching ...

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

Thiziri Nait-Saada, Alireza Naderi, Jared Tanner

TL;DR

The seminal proof of Matthews et al. (2018) is extended to a larger class of initial weight distributions, including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits.

Abstract

The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

TL;DR

The seminal proof of Matthews et al. (2018) is extended to a larger class of initial weight distributions, including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits.

Abstract

The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.
Paper Structure (44 sections, 8 theorems, 67 equations, 9 figures)

This paper contains 44 sections, 8 theorems, 67 equations, 9 figures.

Key Result

Theorem 1

Suppose a fully connected neural network as in equation eq:h_simultaneous_limit is under the Pseudo-iid regime with parameter $\sigma_W^2$ and the activation satisfies the linear envelope property Def. def:activation_function. Let $\mathcal{X}$ be a countably-infinite set of inputs. Then, for every where

Figures (9)

  • Figure 1: There are various ways to compute convolutions $\mathbf{U} \star \mathbf{X}$ between a tensor filter $\mathbf{U}$ and a 2D signal $\mathbf{X}$ (in the middle) from matrix multiplications. We illustrate the approach taken in garrigaalonso_2019 on the left, where the reshaping procedure is applied to the filter, whilst the method we followed, shown on the right, consists of reshaping the signal instead in order to define special structures on the CNN filters such as orthogonality, sparsity and low-rank.
  • Figure 2: For different instances of the Pseudo-iid regime, in the limit, the preactivation given in the first neuron at the fifth layer tends to a Gausssian whose moments are given by Theorem \ref{['thm:main_FCN']}. The experiments were conducted $10000$ times on a random $7$-layer deep fully connected network with input data sampled from $\mathbb{S}^8$.
  • Figure 3: The empirical joint distribution of the preactivations generated by two distinct inputs flowing through the network. The large width limiting distribution as defined in Theorem \ref{['thm:main_FCN']} is included as level curves. The input data $x_a, x_b$ are drawn iid from $\mathbb{S}^9$ and $10000$ experiments were conducted on a $7$-layer fully connected network. The horizontal and vertical axes in each subplot are respectively $h_1^{(5)}(x_a)$ and $h_1^{(5)}(x_b)$.
  • Figure 4: Q-Q plots of the preactivations values in Fig. \ref{['fig:histograms_gaussianity']} as an alternative way of showing the convergence of the preactivation of a fully connected network to a Gaussian as fully characterized in Theorem \ref{['thm:main_FCN']}. The settings of the experiment are the same as those in Fig. \ref{['fig:histograms_gaussianity']}.
  • Figure 5: For fully connected networks in the Pseudo-iid regime, it is shown in Theorem \ref{['thm:main_FCN']} that in the large width limit, at any layer, two neurons fed with the same input data become independent. We compare the joint distribution of the preactivations given in the first and second neurons at the fifth layer with an isotropic Gaussian probability density function. Initializing the weight matrices with iid Cauchy realisations falls outside of our defined framework, resulting in a poor match. The inputs were sampled from $\mathbb{S}^8$ and $10000$ experiments conducted on a $7$-layer network.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Definition 1: Exchangeability
  • Definition 2: Pseudo-iid
  • Definition 3
  • Theorem 1: GP limit for fully connected Pseudo-iid networks
  • Definition 4: Pseudo-iid for CNNs
  • Theorem 2: GP limit for CNN Pseudo-iid networks
  • Theorem 3: Matthews_2018, Lemma 10
  • Lemma 1: Hölder's inequality
  • Lemma 2: Bellingsley's theorem
  • Lemma 3: Sufficient condition to uniformly bound the expectation of a four-cross product
  • ...and 4 more