Table of Contents
Fetching ...

A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression

Tin Sum Cheng, Aurelien Lucchi, Anastasis Kratsios, David Belius

TL;DR

The validity of the Gaussian Equivalent Property (GEP) is demonstrated, which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption.

Abstract

This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of setting such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.

A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression

TL;DR

The validity of the Gaussian Equivalent Property (GEP) is demonstrated, which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption.

Abstract

This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of setting such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.

Paper Structure

This paper contains 66 sections, 46 theorems, 225 equations, 8 figures, 6 tables.

Key Result

Theorem 3.1

Suppose Assumption assumption:GF holds. Assume the eigen-spectrum and the target coefficient both have polynomial decay, that is, $\lambda_k=\Theta_{k}\left(k^{-1-a}\right)$ and $|\theta^*_k|=\Theta_{k}\left(k^{-r}\right)$. Let $s=\frac{2r+a}{1+a}$ be the source coefficient defined in Definition ass where $n$ is the sample size and $(\cdot)_+ \stackrel{\hbox{\upshape\tiny def.}}{=} \max(\cdot,0)$.

Figures (8)

  • Figure 1: Phase diagram of the bound (Equation( \ref{['line:bias:novel_bound']})) of the bias term $\mathcal{B}$ under weak ridge and polynomial eigen-decay. $\lambda_k=\Theta_{k}\left(k^{-1-a}\right)$, $|\theta^*_k|=\Theta_{k}\left(k^{-r}\right)$, for some $a,r>0$. Our result (Propositions \ref{['proposition:bias:ub:asymptotic:ridgeless']}+\ref{['proposition:bias:ub:asymptotic:ridgeless:0']}+\ref{['proposition:bias:lb:asymptotic']}) is on the left, which improves over previous result from barzilai2023generalization (Proposition \ref{['proposition:bias:ub:asymptotic:ridgeless:0']}) on the right. On the left plot, the range of the source coefficient $s=\frac{2r+a}{1+a}$ in Assumption \ref{['assumption:SC']} is shown in gray font in each colored region.
  • Figure 2: Variance $\mathcal{V}$ against sample size $n$ for the Laplacian kernel (left) and the neural tangent kernel with 1 hidden-layer (right) defined on the unit 2-disk, validating Equation (\ref{['line:variance:overfitting']}) where the variance with generic features \ref{['assumption:GF']} can be as good as with independent features \ref{['assumption:IF']} ($\mathcal{V}=\Theta_{n}\left(1\right)$) or qualitatively different ($\mathcal{V}\xrightarrow[]{n\to\infty}\infty$). See Section \ref{['section:experiment']} for more details.
  • Figure 3: Decay of the bias term $\mathcal{B}$ and the variance term $\mathcal{V}$ under different ridge decays and target coefficient decays. All features demonstrate the same theoretical decay, validating the GEP for independent features.
  • Figure 4: A flowchart about the proof techniques in this paper
  • Figure 5: Decay of the bias term $\mathcal{B}$ under strong ridge $\lambda=\lambda_n=\Theta_{}\left(n^{-1-a}\right)$. $\lambda_k=(\frac{2k-1}{2}\pi)^{-1-a}$, $\theta^*_k = (\frac{2k-1}{2}\pi)^{-r}$. Theoretical decay $\mathcal{B}=\mathcal{O}_{}\left(n^{-(1+a)\Tilde{s}}\right)=\mathcal{O}_{}\left(n^{-(1+a)\Tilde{s}}\right)$, where $\tilde{s}=\min\{s,2\}$, source coefficient $s=\frac{2a+r}{1+a}$. (Left): $s=1.5$ and $\mathcal{B}=\mathcal{O}_{}\left(n^{-(1+1)\min\{1.5,2\}}\right)=\mathcal{O}_{}\left(n^{-3}\right)$; (right): $s=2.33>2$ and $\mathcal{B}=\mathcal{O}_{}\left(n^{-(1+0.5)\min\{2.33,2\}}\right)=\mathcal{O}_{}\left(n^{-3}\right)$, showing the saturation effect mentioned in li2022saturation. All features demonstrate the same theoretical decay, validating the GEP.
  • ...and 3 more figures

Theorems & Definitions (104)

  • Definition 2.1: Bias-variance decomposition
  • Remark 2.2: Noiseless labels
  • Definition 2.3: Interpolation space
  • Theorem 3.1: Improved upper bound
  • Definition 4.1: Concentration coefficients barzilai2023generalization
  • Definition 4.2: Effective ranks bartlett2020benign
  • Remark A.1: $k=0$
  • Remark A.2: Intuition
  • Remark A.3: Weaken Assumption (GF)
  • Remark A.4: Infinite rank
  • ...and 94 more