Table of Contents
Fetching ...

Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

TL;DR

The paper analyzes how increasing depth affects the neural tangent kernel (NTK) for overparameterized, infinitely wide ReLU networks trained with gradient descent. It establishes a depth-convergence result for the limiting kernel $\Theta_{\infty}^{(L)}$ and derives a depth-limited limiting predictor via a rough differential equation, with data supported on the sphere. A key finding is that the normalized kernel $\bar{\Theta}_{\infty}^{(L)}$ converges to a constant form (the matrix of ones) as depth grows, while the corresponding regression limit remains well-defined under a fast-to-slow depth-to-width regime; convergence rates, empirical verifications, and extensions to other kernels are discussed. These results illuminate the role of depth in kernel-based generalization in the infinite-width regime and suggest how depth-to-width tradeoffs influence the determinism of NTK-based predictions in practice.

Abstract

Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, under mild conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and small learning rate, the kernel that is obtained allows to represent the output of the learned model with a closed-form solution. This closed-form solution hinges on the invertibility of the limiting kernel, a property that often holds on real-world datasets. In this work, we analyze the sensitivity of large ReLU networks to increasing depths by characterizing the corresponding limiting kernel. Our theoretical results demonstrate that the normalized limiting kernel approaches the matrix of ones. In contrast, they show the corresponding closed-form solution approaches a fixed limit on the sphere. We empirically evaluate the order of magnitude in network depth required to observe this convergent behavior, and we describe the essential properties that enable the generalization of our results to other kernels.

Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

TL;DR

The paper analyzes how increasing depth affects the neural tangent kernel (NTK) for overparameterized, infinitely wide ReLU networks trained with gradient descent. It establishes a depth-convergence result for the limiting kernel and derives a depth-limited limiting predictor via a rough differential equation, with data supported on the sphere. A key finding is that the normalized kernel converges to a constant form (the matrix of ones) as depth grows, while the corresponding regression limit remains well-defined under a fast-to-slow depth-to-width regime; convergence rates, empirical verifications, and extensions to other kernels are discussed. These results illuminate the role of depth in kernel-based generalization in the infinite-width regime and suggest how depth-to-width tradeoffs influence the determinism of NTK-based predictions in practice.

Abstract

Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, under mild conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and small learning rate, the kernel that is obtained allows to represent the output of the learned model with a closed-form solution. This closed-form solution hinges on the invertibility of the limiting kernel, a property that often holds on real-world datasets. In this work, we analyze the sensitivity of large ReLU networks to increasing depths by characterizing the corresponding limiting kernel. Our theoretical results demonstrate that the normalized limiting kernel approaches the matrix of ones. In contrast, they show the corresponding closed-form solution approaches a fixed limit on the sphere. We empirically evaluate the order of magnitude in network depth required to observe this convergent behavior, and we describe the essential properties that enable the generalization of our results to other kernels.

Paper Structure

This paper contains 13 sections, 14 theorems, 46 equations, 1 figure, 1 table.

Key Result

Theorem 1

Suppose we have a fully-connected neural network of depth $L$ with non-linear activation. In the limit as layer widths $n_1, \dots, n_{L-1} \to \infty$, the neural tangent kernel (see Definition def:ntk) $\Theta^{(L)}$ converges in probability to a deterministic limiting kernel: where $\Theta_{\infty}^{(l)}$ is defined recursively by

Figures (1)

  • Figure 1: Convergence rate of $\kappa$ on $X$ and point $x$.

Theorems & Definitions (33)

  • Definition 1: (mean) Covariance of neurons $\Sigma^{(l)}$
  • Definition 2: Neural tangent kernel (NTK)
  • Theorem 1: jacot2018neural
  • Proposition 1
  • proof : Proof sketch
  • Definition 3: Correlation coefficient of $\Sigma^{(L)}(x, x')$
  • Proposition 2: arora2019exact
  • Proposition 3: jacot2018neural
  • Lemma 1: Convergence of $\rho^{(L)}$
  • Definition 4: Normalization of the $\Theta_{\infty}^{(L)}$ kernel
  • ...and 23 more