Optimal Rate of Kernel Regression in Large Dimensions

Weihao Lu; Haobo Zhang; Yicheng Li; Manyun Xu; Qian Lin

Optimal Rate of Kernel Regression in Large Dimensions

Weihao Lu, Haobo Zhang, Yicheng Li, Manyun Xu, Qian Lin

TL;DR

This paper addresses kernel regression in large dimensions with $n \asymp d^{\gamma}$, introducing a framework based on Mendelson complexity $\varepsilon_n^{2}$ and metric entropy $\bar{\varepsilon}_n^{2}$ to tightly bound the excess risk. For inner-product kernels on the sphere and targets in the associated RKHS $\mathcal{H}^{\text{in}}$, it proves minimax rates of $n^{-1/2}$ for $\gamma=2,4,6,\dots$ and derives a complete rate curve for all $\gamma>0$, revealing multiple-descent and periodic-plateau phenomena. The results extend to neural-tangent kernels (NTK) and thus to wide neural networks, offering explicit rate descriptions and corollaries that link kernel methods with deep-network generalization in high dimensions. Overall, the work provides a sharp, geometry-informed understanding of kernel regression in large dimensions and suggests new directions for analyzing generalization in wide networks.

Abstract

We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^γ$ for $γ=2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $γ>0$ and find that the curve of optimal rate varying along $γ$ exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.

Optimal Rate of Kernel Regression in Large Dimensions

TL;DR

This paper addresses kernel regression in large dimensions with

, introducing a framework based on Mendelson complexity

and metric entropy

to tightly bound the excess risk. For inner-product kernels on the sphere and targets in the associated RKHS

, it proves minimax rates of

for

and derives a complete rate curve for all

, revealing multiple-descent and periodic-plateau phenomena. The results extend to neural-tangent kernels (NTK) and thus to wide neural networks, offering explicit rate descriptions and corollaries that link kernel methods with deep-network generalization in high dimensions. Overall, the work provides a sharp, geometry-informed understanding of kernel regression in large dimensions and suggests new directions for analyzing generalization in wide networks.

Abstract

We perform a study on kernel regression for large-dimensional data (where the sample size

is polynomially depending on the dimension

of the samples, i.e.,

for some

). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity

and the metric entropy

respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on

, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is

when

for

. We then further determine the optimal rate of the excess risk of kernel regression for all the

and find that the curve of optimal rate varying along

exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.

Paper Structure (46 sections, 59 theorems, 251 equations, 4 figures)

This paper contains 46 sections, 59 theorems, 251 equations, 4 figures.

Introduction
Related works
Our contribution
Notations
Preliminaries
Warm-ups: optimality of kernel regression with inner product kernels in large dimensions for $\gamma=2,4,6,\cdots$
Main results: optimality of kernel regression in large dimensions for all $\gamma>0$
Applications in Wide Neural Network
Bounds for large dimensional kernel regression
What Can We Expect from Kernel Regression for Large Dimensional Data
Consistency of kernel regression when $n \asymp d^{\gamma}$, $\gamma > 0$
Kernel regressions generalize better than kernel interpolation in large dimensions
Numerical Experiments
Conclusion and Future Works
Proof of Theorems in Section \ref{['sec:main_results']}
...and 31 more sections

Key Result

Lemma 3.2

Suppose that Assumptions assu:trace_class-assu:coef_of_inner_prod_kernel hold. Suppose that $p \geq 0$ is any integer. There exist positive constants $\mathfrak{C}_1$, $\mathfrak{C}_2$, $\mathfrak{C}_3$, and $\mathfrak{C}_4$, such that for any $d \geq \mathfrak{C}$, we have

Figures (4)

Figure 1: A graphical representation of the minimax optimal rate of the excess risk of kernel regression with inner product kernels obtained from Theorem \ref{['thm:near_lower_inner_large_d']}, and Theorem \ref{['thm:near_upper_inner_large_d']}. The solid black line represents the upper bound that matches the minimax lower bound up to a constant factor. The dashed blue line indicates that, for any $\epsilon > 0$, the ratio between the upper and lower bounds differs by at most $n^{-\epsilon}$.
Figure 2: (a) A cartoon of the excess risk of kernel ridge regression when $f_{\star}$ is square-integrable. Borrowed from ghorbani2021linearized. (b) The excess risk of early-stopping kernel regression when $f_{\star} \in \mathcal{H}^{\mathtt{in}}$. Obtained from Theorem \ref{['thm:near_lower_inner_large_d']} and Theorem \ref{['thm:near_upper_inner_large_d']}. (c) The excess risk of kernel interpolation when $f_{\star} \in \mathcal{H}^{\mathtt{in}}$. Obtained from results in liang2020multiple.
Figure 3: Log excess risk decay curves of kernel regression and kernel interpolation with NTK under different asymptotic frameworks $n \asymp d^\gamma$. The blue curves represent the average excess risks computed from 20 trials. The dashed black lines are obtained through logarithmic least-squares regression, with the slopes indicating the convergence rates denoted as $r$. The four sub-figures from left to right and from top to bottom correspond to the settings where $n$ is set to be equal to $d^{0.5}$, $d^{0.8}$, $d^{1.5}$, and $d^{1.8}$ respectively. In each setting, the constant $C$ is chosen from $\{0.001, 0.01, 0.1, 1, 10, 100, 1000\}$, and we report our numerical results under the best choice of $C$.
Figure 4: A similar plot as Figure \ref{['fig:3_1']}, but with the RBF kernel.

Theorems & Definitions (83)

Remark 2.1
Remark 3.1
Lemma 3.2
Theorem 3.3: Upper bound
Lemma 3.4
Theorem 3.5: Minimax lower bound
Lemma 4.1
Theorem 4.2
Theorem 4.3
Remark 4.4
...and 73 more

Optimal Rate of Kernel Regression in Large Dimensions

TL;DR

Abstract

Optimal Rate of Kernel Regression in Large Dimensions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (83)