A Rainbow in Deep Network Black Boxes

Florentin Guth; Brice Ménard; Gaspar Rochette; Stéphane Mallat

A Rainbow in Deep Network Black Boxes

Florentin Guth, Brice Ménard, Gaspar Rochette, Stéphane Mallat

TL;DR

It is proved that rainbow networks define deterministic (hierarchical) kernels in the infinite-width limit and belong to a data-dependent RKHS which does not depend on the weight randomness.

Abstract

A central question in deep learning is to understand the functions learned by deep networks. What is their approximation class? Do the learned weights and representations depend on initialization? Previous empirical work has evidenced that kernels defined by network activations are similar across initializations. For shallow networks, this has been theoretically studied with random feature models, but an extension to deep networks has remained elusive. Here, we provide a deep extension of such random feature models, which we call the rainbow model. We prove that rainbow networks define deterministic (hierarchical) kernels in the infinite-width limit. The resulting functions thus belong to a data-dependent RKHS which does not depend on the weight randomness. We also verify numerically our modeling assumptions on deep CNNs trained on image classification tasks, and show that the trained networks approximately satisfy the rainbow hypothesis. In particular, rainbow networks sampled from the corresponding random feature model achieve similar performance as the trained networks. Our results highlight the central role played by the covariances of network weights at each layer, which are observed to be low-rank as a result of feature learning.

A Rainbow in Deep Network Black Boxes

TL;DR

It is proved that rainbow networks define deterministic (hierarchical) kernels in the infinite-width limit and belong to a data-dependent RKHS which does not depend on the weight randomness.

Abstract

Paper Structure (48 sections, 7 theorems, 117 equations, 10 figures, 1 table)

This paper contains 48 sections, 7 theorems, 117 equations, 10 figures, 1 table.

Introduction
Related work
Lazy versus feature learning.
Random features and neural network Gaussian processes.
Mean-field models.
Representation alignment and hierarchical kernels.
Rainbow networks
Rotations in random feature maps
Random feature network.
Kernel convergence.
Rotational alignment.
Deep rainbow networks
Infinite-width rainbow networks.
Dimensionality reduction.
Gaussian rainbow networks.
...and 33 more sections

Key Result

Theorem 1

Assume that $\mathbb{E}\ifstrempty{x}{}{_{x}}[ {\lVert x \rVert}^2 ] < +\infty$, $\sigma$ is Lipschitz continuous, and $\pi$ has finite fourth order moments. Then there exists a constant $c > 0$ which does not depend on $d_0$ nor $d_1$ such that where $x'$ is an i.i.d. copy of $x$. Suppose that the sorted eigenvalues $\lambda_{1} \geq \cdots \geq \lambda_m \geq \cdots$ of ${\mathbb E}_x[\varphi(x

Figures (10)

Figure 1: A deep rainbow network cascades random feature maps whose weight distributions are learned. They typically have a low-rank covariance. Each layer can be factorized into a linear dimensionality reduction determined by the "colored" (i.e., non-identity) covariance, followed by a non-linear high-dimensional embedding with "white" random features. At each layer, the hidden activations define a kernel which converges to a deterministic rainbow kernel in the infinite-width limit. The activations are however randomly rotated, which induces a similar rotation of the next layer weights.
Figure 2: Convergence of spectra of activations $\hat{\phi}_j$ of finite-width trained scattering networks towards the feature vector $\phi_j$. The figure shows the covariance spectra of activations $\hat{\phi}_j$ for a given layer $j = 4$ and various width scaling $s$ (left) and of the feature vector $\phi_j$ for the seven hidden layers $j \in \mathopen{}\mathclose{\left\{ 1, \dots, 7 \right\}$ (right). The covariance spectrum is a power law of index close to $-1$.
Figure 3: Convergence of activations $\hat{\phi}_j$ of finite-width networks towards the corresponding feature vector $\phi_j$, for scattering networks trained on CIFAR-10 (left) and ResNet trained on ImageNet (right). Both panels show the relative mean squared error $\mathbb{E}\ifstrempty{x}{}{_{x}}[ {\lVert \hat{A}_j \, \hat{\phi}_j(x) - \phi_j(x) \rVert}{}^2 ] / \mathbb{E}\ifstrempty{x}{}{_{x}}[ {\lVert \phi_j(x) \rVert}^2 ]$ between aligned activations $\hat{A}_j \, \hat{\phi}_j$ and the feature vector $\phi_j$. The error decreases as a function of the width scaling $s$ for all layers for the scattering network, and all but the last few layers for ResNet.
Figure 4: The weight covariance estimate $\tilde{C}_j$ converges towards the infinite-dimensional covariance $C_j$ for a three-hidden-layer scattering network trained on CIFAR-10. The first three panels show the behavior of the layer $j=2$. Upper left: spectra of empirical weight covariances $\tilde{C}_j$ as a function of the network sample size $N$ showing the transition from an exponential decay (fitted by the dashed line for $N=1$) to the Marchenko-Pastur spectrum (fitted by the dotted lines). Lower left: test classification performance on CIFAR-10 of the trained networks as a function of the maximum rank of its weight covariance $\tilde{C}_j$. Most of the performance is captured with the first eigenvectors of $\tilde{C}_j$. The curves for different network sample sizes $N$ when estimating $\tilde{C}_j$ overlap and are offset for visual purposes. Upper right: spectrum of empirical weight covariances $\tilde{C}_j$ as a function of the network width scaling $s$. The dashed line is a fit to an exponential decay at low rank. Lower right: relative distance between empirical and true covariances ${\lVert \hat{C}_j - C_j \rVert}{}_\infty / {\lVert C_j \rVert}{}_\infty$, as a function of the width scaling $s$.
Figure 5: Covariance spectra of activations and weights of an ten-hidden-layer scattering network (top) and ResNet-18 (bottom) trained on ImageNet. In both cases, activation spectra (left) mainly follow power-law distribution with index roughly $-1$. Weight spectra (right) show a transition from an exponential decay with a characteristic scale increasing with depth to the Marchenko-Pastur spectral distribution. These behaviors are captured by the rainbow model. For visual purposes, activation and weight spectra are offset by a factor depending on $j$. In addition, we do not show the first layer nor the $1\times1$ convolutional residual branches in ResNet as they have different layer properties.
...and 5 more figures

Theorems & Definitions (9)

Theorem 1
Definition 1
Definition 2
Theorem 2
Theorem 3
Lemma 1
Lemma 2
Lemma 3
Lemma 4

A Rainbow in Deep Network Black Boxes

TL;DR

Abstract

A Rainbow in Deep Network Black Boxes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (9)