On the Inductive Bias of Neural Tangent Kernels
Alberto Bietti, Julien Mairal
TL;DR
The paper analyzes the inductive bias of gradient descent in the lazy, over-parameterized regime through neural tangent kernels (NTKs) across fully-connected and convolutional architectures. It derives a hierarchical, tree-structured NTK for CNNs with generic patching/pooling, establishes Hölder smoothness for ReLU NTKs, and proves deformation stability bounds for CNN mappings. For two-layer ReLU networks, it provides a Mercer spectral decomposition showing eigenvalue decay that yields improved approximation properties over last-layer-only kernels, and contrasts these with arc-cosine kernels. Numerical experiments on image-like data illustrate deformation stability differences between NTKs and kernel nets, highlighting a tradeoff between approximation power and stability, and suggesting regimes beyond pure lazy training could be advantageous. Overall, the work clarifies how kernel-based representations shape the learnability and generalization of deep nets under lazy training, with practical implications for architecture design and regularization.
Abstract
State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.
