Table of Contents
Fetching ...

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Yingzhen Yang, Ping Li

TL;DR

The paper tackles nonparametric regression with over-parameterized two-layer neural networks trained by a novel Preconditioned Gradient Descent (PGD) and early stopping. It shows that PGD induces an integral kernel ${K}^{\mathrm{int}}$ with eigenvalues ${\lambda}^{(\mathrm{int})}_j=\lambda_j^{s+2}$, enabling learning beyond the linear NTK regime for target functions in the interpolation space ${[ {\cal H}_{K} ]}^{s'}$ with $s'\ge3$. Under spherical-uniform inputs, the authors prove a minimax-optimal regression rate ${O}\left(n^{-\frac{d s'}{d s'+d-1}}\right)$ (equivalently ${O}\left(n^{-\frac{2\alpha s'}{2\alpha s'+1}}\right)$ with $2\alpha=d/(d-1)$), and provide a decomposition of the network output into an ${\cal H}_{K}^{(\mathrm{int})}$ component plus a small $L^{\infty}$ residual, along with a local Rademacher-complexity analysis to tightly bound the risk. The results imply that PGD can escape the NTK linear regime and achieve faster, minimax-optimal rates. Simulations corroborate the theory, showing improved generalization and meaningful early-stopping behavior. Overall, the work advances understanding of algorithmic guarantees in deep learning for nonparametric regression and highlights a principled route to sharper generalization via kernel learning beyond NTK.

Abstract

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\RR^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\cO(n^{-\frac{2αs'}{2αs'+1}})$ when the target function is in the interpolation space $\bth{\cH_K}^{s'}$ with $s' \ge 3$. This rate is even sharper than the currently known nearly-optimal rate of $\cO(n^{-\frac{2αs'}{2αs'+1}})\log^2(1/δ)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $δ\in (0,1)$ is a small probability. This rate is also sharper than the standard kernel regression rate of $\cO(n^{-\frac{2α}{2α+1}})$ obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where $2α= d/(d-1)$. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small $L^{\infty}$-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

TL;DR

The paper tackles nonparametric regression with over-parameterized two-layer neural networks trained by a novel Preconditioned Gradient Descent (PGD) and early stopping. It shows that PGD induces an integral kernel with eigenvalues , enabling learning beyond the linear NTK regime for target functions in the interpolation space with . Under spherical-uniform inputs, the authors prove a minimax-optimal regression rate (equivalently with ), and provide a decomposition of the network output into an component plus a small residual, along with a local Rademacher-complexity analysis to tightly bound the risk. The results imply that PGD can escape the NTK linear regime and achieve faster, minimax-optimal rates. Simulations corroborate the theory, showing improved generalization and meaningful early-stopping behavior. Overall, the work advances understanding of algorithmic guarantees in deep learning for nonparametric regression and highlights a principled route to sharper generalization via kernel learning beyond NTK.

Abstract

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in , and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of when the target function is in the interpolation space with . This rate is even sharper than the currently known nearly-optimal rate of ~\citep{Li2024-edr-general-domain}, where is the size of the training data and is a small probability. This rate is also sharper than the standard kernel regression rate of obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where . Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small -norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.
Paper Structure (22 sections, 22 theorems, 7 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 22 theorems, 7 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Theorem 5.1

Let $s \ge 1$, $\alpha = d/(2(d-1))$, $c_T, c_t \in (0,1]$ be positive constants, and $c_T \widehat{T} \le T \le \widehat{T}$. Suppose $\delta \in (0,1)$, m ≳ n^25α(s+2)2α(s+2)+1d^52, N ≳ n^8α(s+2)2α(s+2)+1 (n/δ), and the neural network $f_t = f(\mathbf{W}(t),\cdot)$ is trained by PGD in Algori

Figures (3)

  • Figure 1: Left: illustration of the test loss by GD and PGD for varying $n$ in $[100,1000]$ with a step size of $100$. The shaded area in each plot indicates the standard deviation across $10$ random initializations of the neural network. Right: illustration of the ratio of early stopping time.
  • Figure 2: Illustration of the test loss by PGD, averaged over $10$ random initializations of the neural network.
  • Figure 3: Roadmap of Major Results Leading to Theorem \ref{['theorem:minimax-nonparametric-regression-Kint']}.

Theorems & Definitions (25)

  • Theorem 5.1
  • Theorem 6.1: Yang2025-generalization-two-layer-regression,yang2024gradientdescentfindsoverparameterized
  • Theorem 6.2
  • Lemma 6.3
  • Theorem 6.4
  • Theorem 6.5
  • Lemma 6.6
  • Definition 1.1
  • Theorem 1.1: bartlett2005
  • Theorem 1.2
  • ...and 15 more