Table of Contents
Fetching ...

Finite Depth and Width Corrections to the Neural Tangent Kernel

Boris Hanin, Mihai Nica

TL;DR

The paper analyzes finite-depth and finite-width corrections to the neural tangent kernel (NTK) for deep ReLU networks, showing that the NTK remains stochastic when depth grows proportionally to width, with fluctuations scaling as $ me^{5eta}$ where $eta= frac{d}{ ext{width}}$. It derives precise moment formulas for the diagonal NTK and its SGD Update, decomposing contributions into weight, bias, and cross terms via a detailed path-sum representation on the network's computation graph. The main contributions are the formal mean-variance characterizations ${ m E}[K_N(x,x)]$, ${ m E}[K_N(x,x)^2]$, and ${ m E}[ riangle K_N(x,x)]$, plus analogous bias and mixed-moment results, all valid for networks with varying layer widths. These results imply that deep and wide ReLU networks can exhibit data-dependent feature learning even in regimes where the infinite-width NTK is typically considered to be frozen, highlighting potential weak feature-learning regimes and guiding future exploration of non-ReLU architectures and broader network families.

Abstract

We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that for such deep and wide networks, the NTK has a non-trivial evolution during training by showing that the mean of its first SGD update is also exponential in the ratio of network depth to width. This is sharp contrast to the regime where depth is fixed and network width is very large. Our results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime.

Finite Depth and Width Corrections to the Neural Tangent Kernel

TL;DR

The paper analyzes finite-depth and finite-width corrections to the neural tangent kernel (NTK) for deep ReLU networks, showing that the NTK remains stochastic when depth grows proportionally to width, with fluctuations scaling as where . It derives precise moment formulas for the diagonal NTK and its SGD Update, decomposing contributions into weight, bias, and cross terms via a detailed path-sum representation on the network's computation graph. The main contributions are the formal mean-variance characterizations , , and , plus analogous bias and mixed-moment results, all valid for networks with varying layer widths. These results imply that deep and wide ReLU networks can exhibit data-dependent feature learning even in regimes where the infinite-width NTK is typically considered to be frozen, highlighting potential weak feature-learning regimes and guiding future exploration of non-ReLU architectures and broader network families.

Abstract

We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that for such deep and wide networks, the NTK has a non-trivial evolution during training by showing that the mean of its first SGD update is also exponential in the ratio of network depth to width. This is sharp contrast to the regime where depth is fixed and network width is very large. Our results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime.

Paper Structure

This paper contains 15 sections, 14 theorems, 176 equations, 2 figures.

Key Result

Theorem 1

We have Moreover, we have that ${\mathbb E}\left [K_{\mathcal{N}}(x,x)^2\right]$ is bounded above and below by universal constants times times a multiplicative error $\left(1+O\left(\sum_{i=1}^d \frac{1}{n_i^2} \right) \right)$, where $f \simeq g$ means $f$ is bounded above and below by universal constants times $g.$ In particular, if all the hidden layer widths are equal (i.e. $n_i=n$, for $i=1

Figures (2)

  • Figure 1: Cartoon of the four paths $\gamma_1,\gamma_2,\gamma_3,\gamma_4$ between layers $\ell_1$ and $\ell_2$ in the case where there is no interaction. Paths stay with there original partners $\gamma_1$ with $\gamma_2$ and $\gamma_3$ with $\gamma_4$ at all intermediate layers.
  • Figure 2: Cartoon of the four paths $\gamma_1,\gamma_2,\gamma_3,\gamma_4$ between layers $\ell_1$ and $\ell_2$ in the case where there is exactly one "loop" interaction between the marked layers. Paths swap away from their original partners exactly once at some intermediate layer after $\ell_1$, and then swap back to their original partners before $\ell_2$.

Theorems & Definitions (19)

  • Theorem 1: Mean and Variance of NKT on Diagonal at Init
  • Theorem 2: Mean of Time Derivative of NTK on Diagonal at Init
  • Definition 1: Path in the computational graph of $\mathcal{N}$
  • Definition 2: Weight of a path in the computational graph of $\mathcal{N}$
  • Definition 3: Unordered multisets of edges and their endpoints
  • Proposition 3: Pure weight moments for $K_{\mathcal{N}}, \Delta K_{\mathcal{N}}$
  • Proposition 4: Pure bias moments for $K_{\mathcal{N}}, \Delta K_{\mathcal{N}}$
  • Proposition 5: Mixed bias-weight moments for $K_{\mathcal{N}}, \Delta K_{\mathcal{N}}$
  • Lemma 6: weight contribution to $K_{\mathcal{N}}$ and $\Delta K_{\mathcal{N}}$ as a sum-over-paths
  • Lemma 7: Expectation of $K_{\mathrm{w}},K_{\mathrm{w}}^2,\Delta_{\mathrm{ww}}$ as sums over $2,4$ paths
  • ...and 9 more