Table of Contents
Fetching ...

Generalization error of spectral algorithms

Maksim Velikanov, Maxim Panov, Dmitry Yarotsky

TL;DR

This work develops a unified spectral framework to quantify the generalization error of kernel methods trained by a broad family of spectral algorithms parameterized by a learning profile $h(\lambda)$. By expressing the generalization error as a quadratic functional of $h$, the authors derive explicit loss functionals for Circle and Wishart data models and introduce NMNO as a simple, universal-noise surrogate. Under power-law spectral assumptions, they obtain full loss asymptotics, reveal spectral localization of the error, and identify when overlearning or saturation occurs, including a notable overlearning transition at $\kappa=\nu-1$ in the noiseless regime. The results show universality of the loss with respect to non-spectral problem details under noisy observations and provide insights into optimal algorithm design beyond classical KRR, GF, and interpolation. Overall, the framework offers precise, model-dependent predictions for generalization in kernel-based learning and highlights how spectral scales control learning dynamics with practical implications for kernel methods and neural-kernel correspondences.

Abstract

The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(λ)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(λ)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.

Generalization error of spectral algorithms

TL;DR

This work develops a unified spectral framework to quantify the generalization error of kernel methods trained by a broad family of spectral algorithms parameterized by a learning profile . By expressing the generalization error as a quadratic functional of , the authors derive explicit loss functionals for Circle and Wishart data models and introduce NMNO as a simple, universal-noise surrogate. Under power-law spectral assumptions, they obtain full loss asymptotics, reveal spectral localization of the error, and identify when overlearning or saturation occurs, including a notable overlearning transition at in the noiseless regime. The results show universality of the loss with respect to non-spectral problem details under noisy observations and provide insights into optimal algorithm design beyond classical KRR, GF, and interpolation. Overall, the framework offers precise, model-dependent predictions for generalization in kernel-based learning and highlights how spectral scales control learning dynamics with practical implications for kernel methods and neural-kernel correspondences.

Abstract

The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of specified by profile , and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.
Paper Structure (71 sections, 12 theorems, 231 equations, 3 figures, 1 table)

This paper contains 71 sections, 12 theorems, 231 equations, 3 figures, 1 table.

Key Result

Proposition 1

There exist signed measures $\rho^{(2)}(d\lambda_1, d\lambda_2)$, $\rho^{(1)}(d\lambda)$ and $\rho^{(\varepsilon)}(d\lambda)$ (given in equations eq:learning_measure_first_mom-eq:noise_variance_learning_measure) such that the map $h\mapsto\widehat{f}\mapsto L_{\widehat{f}}$ given by eq:kernel_method

Figures (3)

  • Figure 1: Generalization error of different data models in presence of observation noise converges to our NMNO model (solid) as $N\to\infty$, which in turn converges to its $O(N^{-\#})$ asymptotic (dashed). All plots have $\nu=1.5$. Cosine Wishart is an additional data model not covered by our theory yet converging to NMNO. The difference between Circle and Wishart asymptotic on the plot 3 is due to localization of the error on scale $s=0$ at saturation. For details and extended discussion see Sec. \ref{['sec:experiments']}.
  • Figure 2: Scale diagrams of different KRR regimes for noisy observations. All plots have $\nu=1.2$, while $\kappa=1.0$ in the non-saturated case (left and center) and $\kappa=5.0$ in the saturated case (right). The dotted lines represent the noise $1-\tfrac{s}{\nu}$ and signal $\tfrac{\kappa}{\nu}s$ terms in equation \ref{['eq:nmnoscale']}. The solid lines show the same terms with added components $2S^{(h)}$ and $2S^{(1-h)}$. Left: the sub-optimal ($s_h>s_*$) non-saturated case. Center: the optimal ($s_h=s_*$) non-saturated case. Right: the saturated ($\kappa>2\nu$) case with the choice $s_\eta=\tfrac{\nu}{2\nu+1}$ optimal for KRR, but sub-optimal among general algorithms $h$.
  • Figure 3: Generalization error (left) and profiles $h(\lambda)$ (right) of various algorithms applied to the Circle model with $\nu=1.5$ and noiseless observations with different $\kappa$. Before overlearning transition $\kappa=\nu-1$ optimal algorithms underlearn observations ($h(\lambda)<1$) while starting to overlearn them ($h(\lambda)>1$) after the transition. For details and extended discussion see Section \ref{['sec:experiments']}.

Theorems & Definitions (15)

  • Proposition 1
  • Theorem 1
  • Proposition 2: see proof in Section \ref{['sec:scaling']}
  • Proposition 3
  • Proposition 4
  • Theorem 2
  • Lemma 1
  • proof
  • Proposition 5
  • Theorem 3
  • ...and 5 more