Table of Contents
Fetching ...

Towards a Statistical Understanding of Neural Networks: Beyond the Neural Tangent Kernel Theories

Haobo Zhang, Jianfa Lai, Yicheng Li, Qian Lin, Jun S. Liu

TL;DR

This work addresses the gap between NTK-based fixed-kernel analyses and the feature-learning capabilities of real neural networks within a nonparametric regression framework. It surveys kernel regression theory in fixed and high-dimensional regimes, clarifies limitations of fixed kernels, and introduces an adaptive feature paradigm together with an over-parameterized Gaussian sequence model to study feature learning. The key contributions include formalizing the adaptive feature model, establishing links to kernel regression and Gaussian sequence dynamics, and proposing a tractable prototype (the over-parameterized Gaussian sequence model) to analyze how learnable features influence generalization. The proposed paradigm provides a route to quantify when and how feature learning improves generalization, offering theoretical insight into the practical success of neural nets beyond the NTK regime.

Abstract

A primary advantage of neural networks lies in their feature learning characteristics, which is challenging to theoretically analyze due to the complexity of their training dynamics. We propose a new paradigm for studying feature learning and the resulting benefits in generalizability. After reviewing the neural tangent kernel (NTK) theory and recent results in kernel regression, which address the generalization issue of sufficiently wide neural networks, we examine limitations and implications of the fixed kernel theory (as the NTK theory) and review recent theoretical advancements in feature learning. Moving beyond the fixed kernel/feature theory, we consider neural networks as adaptive feature models. Finally, we propose an over-parameterized Gaussian sequence model as a prototype model to study the feature learning characteristics of neural networks.

Towards a Statistical Understanding of Neural Networks: Beyond the Neural Tangent Kernel Theories

TL;DR

This work addresses the gap between NTK-based fixed-kernel analyses and the feature-learning capabilities of real neural networks within a nonparametric regression framework. It surveys kernel regression theory in fixed and high-dimensional regimes, clarifies limitations of fixed kernels, and introduces an adaptive feature paradigm together with an over-parameterized Gaussian sequence model to study feature learning. The key contributions include formalizing the adaptive feature model, establishing links to kernel regression and Gaussian sequence dynamics, and proposing a tractable prototype (the over-parameterized Gaussian sequence model) to analyze how learnable features influence generalization. The proposed paradigm provides a route to quantify when and how feature learning improves generalization, offering theoretical insight into the practical success of neural nets beyond the NTK regime.

Abstract

A primary advantage of neural networks lies in their feature learning characteristics, which is challenging to theoretically analyze due to the complexity of their training dynamics. We propose a new paradigm for studying feature learning and the resulting benefits in generalizability. After reviewing the neural tangent kernel (NTK) theory and recent results in kernel regression, which address the generalization issue of sufficiently wide neural networks, we examine limitations and implications of the fixed kernel theory (as the NTK theory) and review recent theoretical advancements in feature learning. Moving beyond the fixed kernel/feature theory, we consider neural networks as adaptive feature models. Finally, we propose an over-parameterized Gaussian sequence model as a prototype model to study the feature learning characteristics of neural networks.

Paper Structure

This paper contains 28 sections, 69 equations, 6 figures.

Figures (6)

  • Figure 1: Asymptotic learning curve of kernel gradient flow.
  • Figure 2: Best convergence rates of KRR and corresponding minimax lower rate (w.r.t. $d$) for $s=1.5$ and $\gamma>0$.
  • Figure 3: Phase diagram about the consistency and the optimality of kernel interpolation.
  • Figure 4: The inclusion relationships of the adaptive feature theories, neural networks theories and kernel regression theories.
  • Figure 5: $x$-axis: number of training iterations; $y$-axis: the percentage of the first $p$ projections: $\sum_{j=1}^{p} f_{j}^{2} / \sum_{j=1}^{m} f_{j}^{2}$, p = 1, 100, 300, m = 500. The projections $f_{j}$ concentrate on top eigenspaces as the training proceeds.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 2.2.1: Kernel ridge regression, KRR
  • Definition 2.2.2: Kernel gradient flow, KGF
  • Definition 4.1.1: Adaptive feature model
  • Definition 4.3.1: Over-parameterized Gaussian sequence model