Table of Contents
Fetching ...

Finite-Width Neural Tangent Kernels from Feynman Diagrams

Max Guillen, Philipp Misof, Jan E. Gerken

TL;DR

The feasibility of this framework is demonstrated by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK.

Abstract

Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursion relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We numerically implement the complete set of equations necessary to compute the first-order corrections for arbitrary inputs and demonstrate that the results follow the statistics of sampled neural networks for widths $n\gtrsim 20$.

Finite-Width Neural Tangent Kernels from Feynman Diagrams

TL;DR

The feasibility of this framework is demonstrated by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK.

Abstract

Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursion relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We numerically implement the complete set of equations necessary to compute the first-order corrections for arbitrary inputs and demonstrate that the results follow the statistics of sampled neural networks for widths .

Paper Structure

This paper contains 14 sections, 5 theorems, 7 equations, 3 figures.

Key Result

Theorem 4.1

The Feynman rules postulated in items $(i)$-$(v)$ in conjunction with the selection rules of Appendix app:feynman_rules uniquely determine the recursion relations governing the layer evolution of the NTK tensors $D$, $F$, $A$, $B$ at order $\frac{1}{n}$.

Figures (3)

  • Figure 1: Finite-width corrected kernels. The Monte--Carlo (MC) estimated off-diagonal entry NNGP $\overline{K}_{01}$ and NTK $\overline{\Theta}_{01}$ (red) at the fourth layer of a GeLU-MLP are shown at different hidden layer widths $n=n_\ell$ and compared to the first-order corrected finite-width solution $K^{(\ell)}_{01} +K^{\{1\}(\ell)}_{01} / n_\ell$ and $\Theta^{(\ell)}_{01} + \Theta^{\{1\}(\ell)}_{01} / n_\ell$ (blue), respectively, as well as to infinite-width results (gray). Sample sizes for the MC estimates of the NNGP and NTK are e6 and e5, respectively. Error bars are included, but mostly covered by the mean line. For details, see Section \ref{['sec:experiments']}.
  • Figure 2: Gradient stability. Components of the Monte--Carlo estimated NTK $\overline{\Theta}_{\alpha \beta}$ of a ReLU MLP as a function of layer depth $\ell$ corresponding to single and distinct inputs are shown for three different choices of $C_W^{(\ell)}$. The hidden layers are of size 200. The case for the critical value $C_W^{(\ell)} = C_W^\mathrm{c}=2$ is shown in the middle. For the single input case at criticality (red line in the middle), we also show the expected linear relation roberts2022 (purple). Sample means are obtained from 1000.0 initializations. Error bands are standard errors of the mean but mostly too small to be visible.
  • Figure 3: Finite-width corrections for ReLU. Relative deviations of the Monte--Carlo estimated NTK to its infinite-width counterpart as a function of hidden layer width $n=n_\ell$. A four layer ReLU MLP with $C_W=2$ is sampled over 5.0e6 initializations. Error bars of the sample mean are included for both the single and distinct input component.

Theorems & Definitions (10)

  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • proof
  • Theorem 5.1
  • proof
  • Theorem 5.2
  • proof