Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Yingzhen Yang; Ping Li

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Yingzhen Yang, Ping Li

TL;DR

The paper tackles nonparametric regression with over-parameterized two-layer neural networks trained by a novel Preconditioned Gradient Descent (PGD) and early stopping. It shows that PGD induces an integral kernel ${K}^{\mathrm{int}}$ with eigenvalues ${\lambda}^{(\mathrm{int})}_j=\lambda_j^{s+2}$, enabling learning beyond the linear NTK regime for target functions in the interpolation space ${[ {\cal H}_{K} ]}^{s'}$ with $s'\ge3$. Under spherical-uniform inputs, the authors prove a minimax-optimal regression rate ${O}\left(n^{-\frac{d s'}{d s'+d-1}}\right)$ (equivalently ${O}\left(n^{-\frac{2\alpha s'}{2\alpha s'+1}}\right)$ with $2\alpha=d/(d-1)$), and provide a decomposition of the network output into an ${\cal H}_{K}^{(\mathrm{int})}$ component plus a small $L^{\infty}$ residual, along with a local Rademacher-complexity analysis to tightly bound the risk. The results imply that PGD can escape the NTK linear regime and achieve faster, minimax-optimal rates. Simulations corroborate the theory, showing improved generalization and meaningful early-stopping behavior. Overall, the work advances understanding of algorithmic guarantees in deep learning for nonparametric regression and highlights a principled route to sharper generalization via kernel learning beyond NTK.

Abstract

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\RR^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\cO(n^{-\frac{2αs'}{2αs'+1}})$ when the target function is in the interpolation space $\bth{\cH_K}^{s'}$ with $s' \ge 3$. This rate is even sharper than the currently known nearly-optimal rate of $\cO(n^{-\frac{2αs'}{2αs'+1}})\log^2(1/δ)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $δ\in (0,1)$ is a small probability. This rate is also sharper than the standard kernel regression rate of $\cO(n^{-\frac{2α}{2α+1}})$ obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where $2α= d/(d-1)$. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small $L^{\infty}$-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

TL;DR

with eigenvalues

, enabling learning beyond the linear NTK regime for target functions in the interpolation space

with

. Under spherical-uniform inputs, the authors prove a minimax-optimal regression rate

(equivalently

with

), and provide a decomposition of the network output into an

component plus a small

residual, along with a local Rademacher-complexity analysis to tightly bound the risk. The results imply that PGD can escape the NTK linear regime and achieve faster, minimax-optimal rates. Simulations corroborate the theory, showing improved generalization and meaningful early-stopping behavior. Overall, the work advances understanding of algorithmic guarantees in deep learning for nonparametric regression and highlights a principled route to sharper generalization via kernel learning beyond NTK.

Abstract

, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of

when the target function is in the interpolation space

with

. This rate is even sharper than the currently known nearly-optimal rate of

~\citep{Li2024-edr-general-domain}, where

is the size of the training data and

is a small probability. This rate is also sharper than the standard kernel regression rate of

obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where

. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small

-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.

Paper Structure (22 sections, 22 theorems, 7 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 22 theorems, 7 equations, 3 figures, 1 table, 2 algorithms.

Introduction
Problem Setup
Two-Layer Neural Network
Kernel and Target Function
Summary of Main Results
Training by Gradient Descent and Preconditioned Gradient Descent
Main Results
Kernel Complexity
Nonparametric Regression for Target Function with Spectral Bias
Significance of Theorem \ref{['theorem:minimax-nonparametric-regression-Kint']} and Its Proof
Roadmap of Proofs
Uniform Convergence to the NTK and More
Basic Definitions
Detailed Roadmap and Key Technical Results
Difference from Existing Kernel Learning Theory
...and 7 more sections

Key Result

Theorem 5.1

Let $s \ge 1$, $\alpha = d/(2(d-1))$, $c_T, c_t \in (0,1]$ be positive constants, and $c_T \widehat{T} \le T \le \widehat{T}$. Suppose $\delta \in (0,1)$, m ≳ n^25α(s+2)2α(s+2)+1d^52, N ≳ n^8α(s+2)2α(s+2)+1 (n/δ), and the neural network $f_t = f(\mathbf{W}(t),\cdot)$ is trained by PGD in Algori

Figures (3)

Figure 1: Left: illustration of the test loss by GD and PGD for varying $n$ in $[100,1000]$ with a step size of $100$. The shaded area in each plot indicates the standard deviation across $10$ random initializations of the neural network. Right: illustration of the ratio of early stopping time.
Figure 2: Illustration of the test loss by PGD, averaged over $10$ random initializations of the neural network.
Figure 3: Roadmap of Major Results Leading to Theorem \ref{['theorem:minimax-nonparametric-regression-Kint']}.

Theorems & Definitions (25)

Theorem 5.1
Theorem 6.1: Yang2025-generalization-two-layer-regression,yang2024gradientdescentfindsoverparameterized
Theorem 6.2
Lemma 6.3
Theorem 6.4
Theorem 6.5
Lemma 6.6
Definition 1.1
Theorem 1.1: bartlett2005
Theorem 1.2
...and 15 more

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

TL;DR

Abstract

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (25)