Table of Contents
Fetching ...

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Jiaoyang Huang, Horng-Tzer Yau

TL;DR

This work tackles the finite-width dynamics of deep neural networks trained by gradient flow by introducing the Neural Tangent Hierarchy (NTH), an infinite system of ODEs that governs the evolution of NTK-related kernels during training. It establishes a priori estimates showing higher-order kernels are small (scaling like $m^{-(r/2-1)}$), and proves that a truncated $p$-level hierarchy accurately approximates the full NTK dynamics up to arbitrary precision for suitable time scales. The authors show the NTK evolves at a rate $O(1/m)$, leading to linear convergence guarantees under width conditions and improved bounds over previous results. The truncated hierarchy enables direct analysis and prediction of NTK changes and explains why finite-width networks can outperform kernel regressions with the limiting NTK, with potential extensions to other architectures and discrete-time training.

Abstract

The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other recent papers [6,13,14]. In the overparametrization regime, a fully-trained deep neural network is indeed equivalent to the kernel regression predictor using the limiting NTK. And the gradient descent achieves zero training loss for a deep overparameterized neural network. However, it was observed in [5] that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. This performance gap is likely to originate from the change of the NTK along training due to the finite width effect. The change of the NTK along the training is central to describe the generalization features of deep neural networks. In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. We derive an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network. Moreover, under certain conditions on the neural network width and the data set dimension, we prove that the truncated hierarchy of NTH approximates the dynamic of the NTK up to arbitrary precision. This description makes it possible to directly study the change of the NTK for deep neural networks, and sheds light on the observation that deep neural networks outperform kernel regressions using the corresponding limiting NTK.

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

TL;DR

This work tackles the finite-width dynamics of deep neural networks trained by gradient flow by introducing the Neural Tangent Hierarchy (NTH), an infinite system of ODEs that governs the evolution of NTK-related kernels during training. It establishes a priori estimates showing higher-order kernels are small (scaling like ), and proves that a truncated -level hierarchy accurately approximates the full NTK dynamics up to arbitrary precision for suitable time scales. The authors show the NTK evolves at a rate , leading to linear convergence guarantees under width conditions and improved bounds over previous results. The truncated hierarchy enables direct analysis and prediction of NTK changes and explains why finite-width networks can outperform kernel regressions with the limiting NTK, with potential extensions to other architectures and discrete-time training.

Abstract

The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other recent papers [6,13,14]. In the overparametrization regime, a fully-trained deep neural network is indeed equivalent to the kernel regression predictor using the limiting NTK. And the gradient descent achieves zero training loss for a deep overparameterized neural network. However, it was observed in [5] that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. This performance gap is likely to originate from the change of the NTK along training due to the finite width effect. The change of the NTK along the training is central to describe the generalization features of deep neural networks. In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. We derive an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network. Moreover, under certain conditions on the neural network width and the data set dimension, we prove that the truncated hierarchy of NTH approximates the dynamic of the NTK up to arbitrary precision. This description makes it possible to directly study the change of the NTK for deep neural networks, and sheds light on the observation that deep neural networks outperform kernel regressions using the corresponding limiting NTK.

Paper Structure

This paper contains 11 sections, 11 theorems, 187 equations.

Key Result

Theorem 2.3

Under Assumptions a:sigmaasup and a:nonlinear, there exists an infinite family of operators $K_t^{(r)}: {\mathcal{X}}^r\mapsto {\mathbb R}$ for $r\geqslant 2$, the continuous time gradient descent dynamic is given by an infinite hierarchy of ordinary differential equations, i.e., the NTH, and for any $r\geqslant 2$, There exists a deterministic family (independent of $m$) of operators ${\frak K}

Theorems & Definitions (24)

  • Theorem 2.3
  • Corollary 2.4
  • Corollary 2.5
  • Theorem 2.6
  • Proposition A.1
  • proof : Proof of Proposition \ref{['p:fDstruc']}
  • Proposition A.2
  • Lemma A.3
  • proof : Proof of Proposition \ref{['p:Krlimit']}
  • Claim A.4
  • ...and 14 more