Optimal Rate of Kernel Regression in Large Dimensions
Weihao Lu, Haobo Zhang, Yicheng Li, Manyun Xu, Qian Lin
TL;DR
This paper addresses kernel regression in large dimensions with $n \asymp d^{\gamma}$, introducing a framework based on Mendelson complexity $\varepsilon_n^{2}$ and metric entropy $\bar{\varepsilon}_n^{2}$ to tightly bound the excess risk. For inner-product kernels on the sphere and targets in the associated RKHS $\mathcal{H}^{\text{in}}$, it proves minimax rates of $n^{-1/2}$ for $\gamma=2,4,6,\dots$ and derives a complete rate curve for all $\gamma>0$, revealing multiple-descent and periodic-plateau phenomena. The results extend to neural-tangent kernels (NTK) and thus to wide neural networks, offering explicit rate descriptions and corollaries that link kernel methods with deep-network generalization in high dimensions. Overall, the work provides a sharp, geometry-informed understanding of kernel regression in large dimensions and suggests new directions for analyzing generalization in wide networks.
Abstract
We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^γ$ for $γ=2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $γ>0$ and find that the curve of optimal rate varying along $γ$ exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.
