Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

Thiziri Nait-Saada; Alireza Naderi; Jared Tanner

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

Thiziri Nait-Saada, Alireza Naderi, Jared Tanner

TL;DR

The seminal proof of Matthews et al. (2018) is extended to a larger class of initial weight distributions, including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits.

Abstract

The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

TL;DR

Abstract

Paper Structure (44 sections, 8 theorems, 67 equations, 9 figures)

This paper contains 44 sections, 8 theorems, 67 equations, 9 figures.

Introduction
Why studying low-rank and structured sparse networks at initialization?
Related work
Organization of the paper
Gaussian Process behaviour in the Pseudo-iid regime
The Pseudo-iid regime for fully connected networks
The Pseudo-iid regime for Convolution Neural Networks
Pseudo-iid in practice
Examples of Pseudo-iid distributions
Low-rank weights.
Structured sparse weights.
Orthogonal CNN filters.
Simulations of the Gaussian Processes in Theorem \ref{['thm:main_FCN']} for fully connected networks with Pseudo-iid weights
Implications of the Gaussian Process Limit
Bayesian Neural Network and Gaussian Process.
...and 29 more sections

Key Result

Theorem 1

Suppose a fully connected neural network as in equation eq:h_simultaneous_limit is under the Pseudo-iid regime with parameter $\sigma_W^2$ and the activation satisfies the linear envelope property Def. def:activation_function. Let $\mathcal{X}$ be a countably-infinite set of inputs. Then, for every where

Figures (9)

Figure 1: There are various ways to compute convolutions $\mathbf{U} \star \mathbf{X}$ between a tensor filter $\mathbf{U}$ and a 2D signal $\mathbf{X}$ (in the middle) from matrix multiplications. We illustrate the approach taken in garrigaalonso_2019 on the left, where the reshaping procedure is applied to the filter, whilst the method we followed, shown on the right, consists of reshaping the signal instead in order to define special structures on the CNN filters such as orthogonality, sparsity and low-rank.
Figure 2: For different instances of the Pseudo-iid regime, in the limit, the preactivation given in the first neuron at the fifth layer tends to a Gausssian whose moments are given by Theorem \ref{['thm:main_FCN']}. The experiments were conducted $10000$ times on a random $7$-layer deep fully connected network with input data sampled from $\mathbb{S}^8$.
Figure 3: The empirical joint distribution of the preactivations generated by two distinct inputs flowing through the network. The large width limiting distribution as defined in Theorem \ref{['thm:main_FCN']} is included as level curves. The input data $x_a, x_b$ are drawn iid from $\mathbb{S}^9$ and $10000$ experiments were conducted on a $7$-layer fully connected network. The horizontal and vertical axes in each subplot are respectively $h_1^{(5)}(x_a)$ and $h_1^{(5)}(x_b)$.
Figure 4: Q-Q plots of the preactivations values in Fig. \ref{['fig:histograms_gaussianity']} as an alternative way of showing the convergence of the preactivation of a fully connected network to a Gaussian as fully characterized in Theorem \ref{['thm:main_FCN']}. The settings of the experiment are the same as those in Fig. \ref{['fig:histograms_gaussianity']}.
Figure 5: For fully connected networks in the Pseudo-iid regime, it is shown in Theorem \ref{['thm:main_FCN']} that in the large width limit, at any layer, two neurons fed with the same input data become independent. We compare the joint distribution of the preactivations given in the first and second neurons at the fifth layer with an isotropic Gaussian probability density function. Initializing the weight matrices with iid Cauchy realisations falls outside of our defined framework, resulting in a poor match. The inputs were sampled from $\mathbb{S}^8$ and $10000$ experiments conducted on a $7$-layer network.
...and 4 more figures

Theorems & Definitions (14)

Definition 1: Exchangeability
Definition 2: Pseudo-iid
Definition 3
Theorem 1: GP limit for fully connected Pseudo-iid networks
Definition 4: Pseudo-iid for CNNs
Theorem 2: GP limit for CNN Pseudo-iid networks
Theorem 3: Matthews_2018, Lemma 10
Lemma 1: Hölder's inequality
Lemma 2: Bellingsley's theorem
Lemma 3: Sufficient condition to uniformly bound the expectation of a four-cross product
...and 4 more

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

TL;DR

Abstract

Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)