Table of Contents
Fetching ...

On the Inductive Bias of Neural Tangent Kernels

Alberto Bietti, Julien Mairal

TL;DR

The paper analyzes the inductive bias of gradient descent in the lazy, over-parameterized regime through neural tangent kernels (NTKs) across fully-connected and convolutional architectures. It derives a hierarchical, tree-structured NTK for CNNs with generic patching/pooling, establishes Hölder smoothness for ReLU NTKs, and proves deformation stability bounds for CNN mappings. For two-layer ReLU networks, it provides a Mercer spectral decomposition showing eigenvalue decay that yields improved approximation properties over last-layer-only kernels, and contrasts these with arc-cosine kernels. Numerical experiments on image-like data illustrate deformation stability differences between NTKs and kernel nets, highlighting a tradeoff between approximation power and stability, and suggesting regimes beyond pure lazy training could be advantageous. Overall, the work clarifies how kernel-based representations shape the learnability and generalization of deep nets under lazy training, with practical implications for architecture design and regularization.

Abstract

State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.

On the Inductive Bias of Neural Tangent Kernels

TL;DR

The paper analyzes the inductive bias of gradient descent in the lazy, over-parameterized regime through neural tangent kernels (NTKs) across fully-connected and convolutional architectures. It derives a hierarchical, tree-structured NTK for CNNs with generic patching/pooling, establishes Hölder smoothness for ReLU NTKs, and proves deformation stability bounds for CNN mappings. For two-layer ReLU networks, it provides a Mercer spectral decomposition showing eigenvalue decay that yields improved approximation properties over last-layer-only kernels, and contrasts these with arc-cosine kernels. Numerical experiments on image-like data illustrate deformation stability differences between NTKs and kernel nets, highlighting a tradeoff between approximation power and stability, and suggesting regimes beyond pure lazy training could be advantageous. Overall, the work clarifies how kernel-based representations shape the learnability and generalization of deep nets under lazy training, with practical implications for architecture design and regularization.

Abstract

State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.

Paper Structure

This paper contains 39 sections, 17 theorems, 96 equations, 2 figures.

Key Result

Lemma 1

The NTK for the fully-connected network can be defined as $K(x, x') = \langle \Phi_n(x), \Phi_n(x') \rangle$, with $\Phi_0(x) = \Psi_0(x) = x$ and for $k \geq 1$, where $\otimes$ is the tensor product.

Figures (2)

  • Figure 1: Geometry of kernel mapping for CKN and NTK convolutional kernels, on digit images and their deformations from the Infinite MNIST dataset loosli2007training. The curves show average relative distances of a single digit to its deformations, combinations of translations and deformations, digits of the same label, and digits of any label. See Appendix \ref{['sec:appx_numerical']} for more details on the experimental setup.
  • Figure 2: MNIST digits with transformations considered in our numerical study of stability. Each row gives examples of images from a set of digits that are compared to a reference image of a "5". From top to bottom: deformations with $\alpha = 3$; translations and deformations with $\alpha = 1$; digits from the training set with the same label "5" as the reference digit; digits from the training set with any label.

Theorems & Definitions (29)

  • Lemma 1: NTK feature map for fully-connected network
  • Proposition 2: NTK feature map for CNN
  • Proposition 3: Non-Lipschitzness
  • Proposition 4: Smoothness for ReLU NTK
  • Proposition 5: Mercer decomposition of ReLU NTK
  • Corollary 6: Sufficient condition for $f \in {\mathcal{H}}$
  • Corollary 7: Approximation of Lipschitz functions
  • Proposition 8: RKHS of the homogeneous NTK
  • Proposition 9: Lipschitzness for smooth activations
  • Lemma 10: Smoothness of operator $M$
  • ...and 19 more