A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

Theodor Misiakiewicz; Basil Saeed

A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

Theodor Misiakiewicz, Basil Saeed

TL;DR

This paper establishes in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel.

Abstract

We consider learning an unknown target function $f_*$ using kernel ridge regression (KRR) given i.i.d. data $(u_i,y_i)$, $i\leq n$, where $u_i \in U$ is a covariate vector and $y_i = f_* (u_i) +\varepsilon_i \in \mathbb{R}$. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that only depends on the spectrum of the kernel operator. However, a theoretical justification for this equivalence has so far relied either on restrictive assumptions -- such as subgaussian independent eigenfunctions -- , or asymptotic derivations for specific kernels in high dimensions. In this paper, we prove that this equivalence holds for a general class of problems satisfying some spectral and concentration properties on the kernel eigendecomposition. Specifically, we establish in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel. Our proofs rely on a careful derivation of deterministic equivalents for random matrix functionals in the dimension free regime pioneered by Cheng and Montanari (2022). We apply this setting to several classical examples and show an excellent agreement between theoretical predictions and numerical simulations. These results rely on having access to the eigendecomposition of the kernel operator. Alternatively, we prove that, under this same setting, the generalized cross-validation (GCV) estimator concentrates on the test error uniformly over a range of ridge regularization parameter that includes zero (the interpolating solution). As a consequence, the GCV estimator can be used to estimate from data the test error and optimal regularization parameter for KRR.

A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

TL;DR

Abstract

We consider learning an unknown target function

using kernel ridge regression (KRR) given i.i.d. data

, where

is a covariate vector and

. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that only depends on the spectrum of the kernel operator. However, a theoretical justification for this equivalence has so far relied either on restrictive assumptions -- such as subgaussian independent eigenfunctions -- , or asymptotic derivations for specific kernels in high dimensions. In this paper, we prove that this equivalence holds for a general class of problems satisfying some spectral and concentration properties on the kernel eigendecomposition. Specifically, we establish in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel. Our proofs rely on a careful derivation of deterministic equivalents for random matrix functionals in the dimension free regime pioneered by Cheng and Montanari (2022). We apply this setting to several classical examples and show an excellent agreement between theoretical predictions and numerical simulations. These results rely on having access to the eigendecomposition of the kernel operator. Alternatively, we prove that, under this same setting, the generalized cross-validation (GCV) estimator concentrates on the test error uniformly over a range of ridge regularization parameter that includes zero (the interpolating solution). As a consequence, the GCV estimator can be used to estimate from data the test error and optimal regularization parameter for KRR.

Paper Structure (57 sections, 49 theorems, 740 equations, 4 figures)

This paper contains 57 sections, 49 theorems, 740 equations, 4 figures.

Introduction
Background
A deterministic equivalent for the test error
Summary of main results
Related literature
Test error of kernel ridge regression
Setting and definitions
Assumptions
A master theorem
Kernel operator and sufficient conditions
Two classical examples
The case of concentrated features
Inner-product kernels on the sphere
Uniform consistency of the GCV estimator
Outline of the proofs
...and 42 more sections

Key Result

Theorem 1

Consider $D,K>0$, integer $n$, regularization parameter $\lambda \geq 0$, and target function $f_* \in L^2 (\mathcal{U})$ with parameters $\| {\boldsymbol \beta}_*\|_2 = \| f_* \|_{L^2} < \infty$. Assume that the features $\{{\boldsymbol x}_i\}_{i\in[n]}$ and $f_*$ satisfy Assumption ass:main_assump then with probability at least $1 - n^{-D} - p_{2,n} ({\mathsf m})$, we have where the relative ap

Figures (4)

Figure 1: Test error of KRR plotted against the training sample size $n \in \{2, \ldots, 20000\}$. We consider data $({\boldsymbol u}_i,y_i)$ from MNIST where $u_i$ is a $d=28^2$ dimensional image and $y_i \in \{0,9\}$ is its label. We fit this data using KRR \ref{['eq:KRR_problem_RKHS']} with three standard kernels $K_j ({\boldsymbol u},{\boldsymbol u}')$; one simulation uses the ReLU NTK of depth 5 and the other two correspond to the RBF kernel with the bandwidths specified in the figure. We take $\lambda =0$ (interpolating solution). The continuous lines correspond to the theoretical predictions from the deterministic equivalent \ref{['eq:DetEquiv_Risk_Intro']}, where the eigenvalues of ${\boldsymbol \Sigma}$ and ${\boldsymbol \beta}^*$ are estimated from a sample of the data of size $25000$. For the empirical test errors (markers), we solve KRR on $n$ images sampled uniformly from the training set, and report the average and the standard deviation of the test error over $50$ independent realizations.
Figure 2: Test error of KRR plotted against the training sample size $n \in \{2, \ldots, 20000\}$. We consider data $({\boldsymbol u}_i,y_i)$ with ${\boldsymbol u}_i \sim_{iid} {\rm Unif} (\mathbb{S}^{d-1} (\sqrt{d}))$, $d=24$, and $y_i = f_* ({\boldsymbol u}_i) + \varepsilon_i$ with independent label noise $\varepsilon_i \sim {\sf N} (0, \sigma_\varepsilon^2)$, $\sigma_\varepsilon^2 = 0.1$. We fit this data using KRR \ref{['eq:KRR_problem_RKHS']} with three different inner-product kernels $K_j ({\boldsymbol u},{\boldsymbol u}') = h_j (\langle {\boldsymbol u} , {\boldsymbol u}'\rangle/d)$, $j \in [3]$, corresponding to spectral gaps $\in \{8,32,128\}$, and regularization parameter $\lambda =0$. The continuous lines correspond to the theoretical predictions from the deterministic equivalent \ref{['eq:DetEquiv_Risk_Intro']}. For the empirical test errors (markers), we report the average and the standard deviation of the test error over $50$ independent realizations. See Section \ref{['sec:inner-product_main']} for details.
Figure 3: Test error of ridge regression with features ${\boldsymbol x} = \sigma ({\boldsymbol W} {\boldsymbol u}) \in \mathbb{R}^p$ with covariates ${\boldsymbol u} \sim {\sf N} (0,{\mathbf I}_d)$, where the weight matrix ${\boldsymbol W} = [{\boldsymbol w}_1 , \ldots , {\boldsymbol w}_p]^{\mathsf T} \in \mathbb{R}^{p \times d}$ is fixed and the activation is chosen $\sigma \in \{ \text{ReLu}, \text{sigmoid}, \text{tanh}\}$. We take ${\boldsymbol w}_j \sim_{i.i.d.} {\sf N}(0,{\mathbf I}_d/d)$ and target function $f_* ({\boldsymbol u}) = \frac{1}{\sqrt{2}}\langle {\boldsymbol e}, {\boldsymbol u} \rangle + \frac{1}{2} \left( \langle {\boldsymbol e} , {\boldsymbol u} \rangle^2 - 1 \right)$ with ${\boldsymbol e} \in \mathbb{R}^d$, $\| {\boldsymbol e} \|_2 = 1$ chosen arbitrarily. Here we set $d = 60$, $p = 120$, $\lambda = 0.1$, and $\sigma_\varepsilon^2 =0$. For the empirical test errors (markers), we report the average and the standard deviation of the test error over $50$ independent realizations.
Figure 4: Predictions of the GCV estimator compared to its deterministic equivalent plotted against the regularization parameter $\lambda \in [10^{-3}, 10^3]$ and sample size $n \in \{300,1000,5000\}$. The setting for the synthetic data (left) is the same as Figure \ref{['fig:sphere_test']} with gap equal to $128$, while the setting for the real data (right) is the same as in Figure \ref{['fig:real_test']} with the NTK kernel of depth 5. The continuous lines correspond to the theoretical predictions as computed by the deterministic equivalents, while each of the 20 dashed lines corresponds to the GCV estimator computed from a sample of size $n$.

Theorems & Definitions (83)

Definition 1: Effective regularization
Definition 2: Ranks
Theorem 1: Deterministic equivalent for the KRR test error
Proposition 1
Remark 2.1: The case ${\mathcal{D}} \subsetneq L^2 (\mathcal{U})$.
Corollary 1: Test error with concentrated features
Corollary 2
Remark 3.1: Local and average laws
Theorem 2: Test error for inner-product kernels on the sphere
Theorem 3: Uniform consistency of the GCV estimator
...and 73 more

A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

TL;DR

Abstract

A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (83)