On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

Guhan Chen; Yicheng Li; Qian Lin

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

Guhan Chen, Yicheng Li, Qian Lin

TL;DR

It is shown that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization, which implies that NTK theory may not fully explain the superior performance of neural networks.

Abstract

This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process $f^{\mathrm{GP}}$, which takes values in $L^{2}(\mathcal{X})$, where $\mathcal{X}$ is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. In this paper, we first show that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization $f^{\mathrm{GP}}$. We then show that $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 1$ for any $s < \frac{3}{d+1}$ and $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 0$ for any $s \geq \frac{3}{d+1}$, where $[\mathcal{H}^{\mathrm{NT}}]^{s}$ is the real interpolation space of the RKHS $\mathcal{H}^{\mathrm{NT}}$ associated with the NTK. Consequently, the generalization error of the wide neural network trained by gradient descent is $Ω(n^{-\frac{3}{d+3}})$, and it still suffers from the curse of dimensionality. On one hand, the result highlights the benefits of mirror initialization. On the other hand, it implies that NTK theory may not fully explain the superior performance of neural networks.

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

TL;DR

Abstract

, which takes values in

, where

is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. In this paper, we first show that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization

. We then show that

for any

and

for any

, where

is the real interpolation space of the RKHS

associated with the NTK. Consequently, the generalization error of the wide neural network trained by gradient descent is

, and it still suffers from the curse of dimensionality. On one hand, the result highlights the benefits of mirror initialization. On the other hand, it implies that NTK theory may not fully explain the superior performance of neural networks.

Paper Structure (51 sections, 29 theorems, 128 equations, 4 figures, 1 table)

This paper contains 51 sections, 29 theorems, 128 equations, 4 figures, 1 table.

Introduction
Our contribution
Related works
Preliminaries
Model and notations
Notations
Reproducing kernel Hilbert space
Real interpolation space
Kernel gradient flow
Network and Neural Tangent Kernel
Network settings
Standard initialization
Network at initialization
Gaussian process
The kernel regime
...and 36 more sections

Key Result

Proposition 2.2

Suppose the eigenvalue decay rate of $k$ is $\beta$ and the embedding index is $\frac{1}{\beta}$ with respect to $\mu$. Suppose the noise term $\epsilon$ satisfies Assumption assu: noise. Let the dynamic eq: KGD_equation starts from $f_0^{\mathrm{GF}} = 0$. Also, suppose the regression function sati where $C$ is a positive constant.

Figures (4)

Figure 1: Generalization error decay curve of network. The scatter points show the averaged log error over $20$ trials. The dashed lines are computed through least-squares. The scale of $n$ is not broad because a larger $n$ requires a larger m , which would induce higher computational costs.
Figure 2: Decay curve of the logarithm of sum of squared coefficients for NMIST.
Figure 3: Decay curve of the logarithm of sum of squared coefficients for Fashion-NMIST.
Figure 4: Decay curve of the logarithm of sum of squared coefficients for CIFAR-10.

Theorems & Definitions (45)

Definition 2.1: Relative smoothness
Proposition 2.2
Remark 3.1: Mirrored initialization
Lemma 3.2: Limit distribution of initialization
Proposition 3.3: Uniform convergence
Proposition 4.1: Impact of initialization in kernel gradient flow
Theorem 4.2: Smoothness of Gaussian Process
Theorem 4.3: Generalization error upper bound
Theorem 4.4: Generalization error lower bound
Proposition B.1
...and 35 more

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

TL;DR

Abstract

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (45)