Table of Contents
Fetching ...

Generalization in Kernel Regression Under Realistic Assumptions

Daniel Barzilai, Ohad Shamir

TL;DR

This work develops a unified theory for generalization in kernel regression under realistic, mild assumptions, introducing relative eigenvalue deviation bounds that reveal self-regularization from the kernel spectrum. It provides a general excess-risk bound that holds across regularization strengths, noise levels, input dimensions, and sample sizes, and analyzes regimes of benign and tempered overfitting in both high- and fixed-dimensional settings. The results apply to common kernels including NTKs, yielding explicit learning-rate bounds for regularized regression and, via NTK equivalence, time-dependent bounds for neural networks trained in the kernel regime. By linking spectral properties to generalization and connecting kernel methods to neural nets, the paper offers practical insights into the generalization behavior of over-parameterized models and NTKs.

Abstract

It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.

Generalization in Kernel Regression Under Realistic Assumptions

TL;DR

This work develops a unified theory for generalization in kernel regression under realistic, mild assumptions, introducing relative eigenvalue deviation bounds that reveal self-regularization from the kernel spectrum. It provides a general excess-risk bound that holds across regularization strengths, noise levels, input dimensions, and sample sizes, and analyzes regimes of benign and tempered overfitting in both high- and fixed-dimensional settings. The results apply to common kernels including NTKs, yielding explicit learning-rate bounds for regularized regression and, via NTK equivalence, time-dependent bounds for neural networks trained in the kernel regime. By linking spectral properties to generalization and connecting kernel methods to neural nets, the paper offers practical insights into the generalization behavior of over-parameterized models and NTKs.

Abstract

It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.
Paper Structure (29 sections, 29 theorems, 182 equations, 2 figures)

This paper contains 29 sections, 29 theorems, 182 equations, 2 figures.

Key Result

Theorem 1

Suppose Assumption assumption:good_beta holds, and that the eigenvalues of $\Sigma$ are given in non-increasing order $\lambda_1\geq \lambda_2 \geq \ldots$. There exist some absolute constants $c,C,c_1, c_2>0$ s.t for any $k\leq k' \in [n]$ and $\delta>0$, it holds w.p at least $1- \delta - 4 \frac and where $\mathbb{I}_{k,n}=$.

Figures (2)

  • Figure 1: Variance of unregularized Kernel Regression, measured by the MSE for learning a constant $0$ function with noise $\epsilon_i \sim \mathcal{N}(0, 1)$ and inputs uniformly in $\mathbb{S}^{d-1}$ ($\log$-$\log$ scale). Left: Polynomial kernel $K(\mathbf{x},\mathbf{x}')=(1+\frac{1}{d}\langle \mathbf{x}, \mathbf{x}'\rangle) ^ 3$; Right: NTK corresponding to a $3$-layer fully-connected network (see Appendix \ref{['appendix:experiments']}). As the input dimension grows, the multiple descent phenomenon becomes more pronounced, and the MSE at the "valleys" decreases. The shaded region denotes $95\%$ confidence over $50$ trials with $2500$ test samples each.
  • Figure 2: Apparently diverging variance in low dimensions, for a GPK corresponding to a 3-layer fully-connected network (see Appendix \ref{['appendix:experiments']}) with inputs distributed uniformly on the unit disk $\{x\in\mathbb{R}^2 : \left\lVert x\right\rVert\leq 1\}$ and noise $\epsilon\sim \mathcal{N}(0, 1)$. The solid line denotes the median variance (and not mean, due to extreme values), and the shaded region denotes $95\%$ confidence over $100$ trials with $5000$ test samples each. This suggests that previous works that inferred $V\leq \mathcal{O}(1)$ for kernels with polynomially decaying eigenvalues may be overly optimistic.

Theorems & Definitions (56)

  • Definition 1
  • Definition 2
  • Remark 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 1
  • proof
  • ...and 46 more