Table of Contents
Fetching ...

Optimal Rate of Kernel Regression in Large Dimensions

Weihao Lu, Haobo Zhang, Yicheng Li, Manyun Xu, Qian Lin

TL;DR

This paper addresses kernel regression in large dimensions with $n \asymp d^{\gamma}$, introducing a framework based on Mendelson complexity $\varepsilon_n^{2}$ and metric entropy $\bar{\varepsilon}_n^{2}$ to tightly bound the excess risk. For inner-product kernels on the sphere and targets in the associated RKHS $\mathcal{H}^{\text{in}}$, it proves minimax rates of $n^{-1/2}$ for $\gamma=2,4,6,\dots$ and derives a complete rate curve for all $\gamma>0$, revealing multiple-descent and periodic-plateau phenomena. The results extend to neural-tangent kernels (NTK) and thus to wide neural networks, offering explicit rate descriptions and corollaries that link kernel methods with deep-network generalization in high dimensions. Overall, the work provides a sharp, geometry-informed understanding of kernel regression in large dimensions and suggests new directions for analyzing generalization in wide networks.

Abstract

We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^γ$ for $γ=2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $γ>0$ and find that the curve of optimal rate varying along $γ$ exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.

Optimal Rate of Kernel Regression in Large Dimensions

TL;DR

This paper addresses kernel regression in large dimensions with , introducing a framework based on Mendelson complexity and metric entropy to tightly bound the excess risk. For inner-product kernels on the sphere and targets in the associated RKHS , it proves minimax rates of for and derives a complete rate curve for all , revealing multiple-descent and periodic-plateau phenomena. The results extend to neural-tangent kernels (NTK) and thus to wide neural networks, offering explicit rate descriptions and corollaries that link kernel methods with deep-network generalization in high dimensions. Overall, the work provides a sharp, geometry-informed understanding of kernel regression in large dimensions and suggests new directions for analyzing generalization in wide networks.

Abstract

We perform a study on kernel regression for large-dimensional data (where the sample size is polynomially depending on the dimension of the samples, i.e., for some ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity and the metric entropy respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on , we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is when for . We then further determine the optimal rate of the excess risk of kernel regression for all the and find that the curve of optimal rate varying along exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.
Paper Structure (46 sections, 59 theorems, 251 equations, 4 figures)

This paper contains 46 sections, 59 theorems, 251 equations, 4 figures.

Key Result

Lemma 3.2

Suppose that Assumptions assu:trace_class-assu:coef_of_inner_prod_kernel hold. Suppose that $p \geq 0$ is any integer. There exist positive constants $\mathfrak{C}_1$, $\mathfrak{C}_2$, $\mathfrak{C}_3$, and $\mathfrak{C}_4$, such that for any $d \geq \mathfrak{C}$, we have

Figures (4)

  • Figure 1: A graphical representation of the minimax optimal rate of the excess risk of kernel regression with inner product kernels obtained from Theorem \ref{['thm:near_lower_inner_large_d']}, and Theorem \ref{['thm:near_upper_inner_large_d']}. The solid black line represents the upper bound that matches the minimax lower bound up to a constant factor. The dashed blue line indicates that, for any $\epsilon > 0$, the ratio between the upper and lower bounds differs by at most $n^{-\epsilon}$.
  • Figure 2: (a) A cartoon of the excess risk of kernel ridge regression when $f_{\star}$ is square-integrable. Borrowed from ghorbani2021linearized. (b) The excess risk of early-stopping kernel regression when $f_{\star} \in \mathcal{H}^{\mathtt{in}}$. Obtained from Theorem \ref{['thm:near_lower_inner_large_d']} and Theorem \ref{['thm:near_upper_inner_large_d']}. (c) The excess risk of kernel interpolation when $f_{\star} \in \mathcal{H}^{\mathtt{in}}$. Obtained from results in liang2020multiple.
  • Figure 3: Log excess risk decay curves of kernel regression and kernel interpolation with NTK under different asymptotic frameworks $n \asymp d^\gamma$. The blue curves represent the average excess risks computed from 20 trials. The dashed black lines are obtained through logarithmic least-squares regression, with the slopes indicating the convergence rates denoted as $r$. The four sub-figures from left to right and from top to bottom correspond to the settings where $n$ is set to be equal to $d^{0.5}$, $d^{0.8}$, $d^{1.5}$, and $d^{1.8}$ respectively. In each setting, the constant $C$ is chosen from $\{0.001, 0.01, 0.1, 1, 10, 100, 1000\}$, and we report our numerical results under the best choice of $C$.
  • Figure 4: A similar plot as Figure \ref{['fig:3_1']}, but with the RBF kernel.

Theorems & Definitions (83)

  • Remark 2.1
  • Remark 3.1
  • Lemma 3.2
  • Theorem 3.3: Upper bound
  • Lemma 3.4
  • Theorem 3.5: Minimax lower bound
  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Remark 4.4
  • ...and 73 more