Table of Contents
Fetching ...

Demystifying Spectral Bias on Real-World Data

Itay Lavie, Zohar Ringel

TL;DR

The paper tackles spectral bias in kernel ridge regression and Gaussian processes by introducing cross-dataset learnability, which uses an auxiliary, symmetry-respecting measure $q$ to bound learnability on real data without solving the intractable eigenproblem on the target distribution. It derives a tight, practical bound that depends on kernel eigenvalues/eigenfunctions under $q$ and the target’s projection onto these eigenfunctions, plus corollaries linking to desired sample complexity and covariate-shift performance. The authors provide theoretical guarantees, a universal bound (and its corollary lower bound on sample complexity), and empirical validation on real datasets (CIFAR-10, Fashion-MNIST, MNIST) as well as illustrative vignettes with linear regression on manifolds and Transformer copying-head tasks. The approach leverages kernel symmetries via representation theory to transfer favorable spectral properties from the idealized measure $q$ to real data, offering a principled way to anticipate spectral bias and sample complexity, with potential extensions to ridgeless regression and broader architectures.

Abstract

Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning, with recent applications to highly over-parameterized deep neural networks. The ability of these tools to learn a target function is directly related to the eigenvalues of their kernel sampled on the input data distribution. Targets that have support on higher eigenvalues are more learnable. However, solving such eigenvalue problems on real-world data remains a challenge. Here, we consider cross-dataset learnability and show that one may use eigenvalues and eigenfunctions associated with highly idealized data measures to reveal spectral bias on complex datasets and bound learnability on real-world data. This allows us to leverage various symmetries that realistic kernels manifest to unravel their spectral bias.

Demystifying Spectral Bias on Real-World Data

TL;DR

The paper tackles spectral bias in kernel ridge regression and Gaussian processes by introducing cross-dataset learnability, which uses an auxiliary, symmetry-respecting measure to bound learnability on real data without solving the intractable eigenproblem on the target distribution. It derives a tight, practical bound that depends on kernel eigenvalues/eigenfunctions under and the target’s projection onto these eigenfunctions, plus corollaries linking to desired sample complexity and covariate-shift performance. The authors provide theoretical guarantees, a universal bound (and its corollary lower bound on sample complexity), and empirical validation on real datasets (CIFAR-10, Fashion-MNIST, MNIST) as well as illustrative vignettes with linear regression on manifolds and Transformer copying-head tasks. The approach leverages kernel symmetries via representation theory to transfer favorable spectral properties from the idealized measure to real data, offering a principled way to anticipate spectral bias and sample complexity, with potential extensions to ridgeless regression and broader architectures.

Abstract

Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning, with recent applications to highly over-parameterized deep neural networks. The ability of these tools to learn a target function is directly related to the eigenvalues of their kernel sampled on the input data distribution. Targets that have support on higher eigenvalues are more learnable. However, solving such eigenvalue problems on real-world data remains a challenge. Here, we consider cross-dataset learnability and show that one may use eigenvalues and eigenfunctions associated with highly idealized data measures to reveal spectral bias on complex datasets and bound learnability on real-world data. This allows us to leverage various symmetries that realistic kernels manifest to unravel their spectral bias.
Paper Structure (19 sections, 3 theorems, 52 equations, 3 figures)

This paper contains 19 sections, 3 theorems, 52 equations, 3 figures.

Key Result

Proposition 2.1

Given the expected importance ratios defined in Eq. eq:importance_ratios with ${\rm MSE}= \mathop{\mathrm{\mathbb{E}}}\limits_{x \sim p} \left[ \left(f(x)-y(x) \right)^2 \right]$ and $\bar{I},\,\bar{J}$ the expected density ratios The proof is given in Appendix appendix:prop_proof.

Figures (3)

  • Figure 1: (The onset of learnability is tightly bounded in an idealized setting) The cross-dataset learnability (dots) and our bound on the cross-dataset learnability (dashed) of a random linear $\phi_1$, quadratic $\phi_2$ and cubic $\phi_4$ target features. The trainset consists of $10^4$ samples drawn uniformly on the hypersphere $\mathbb{S}^{7}$ and $q$ is a uniform (continuous) distribution on the hypersphere. The shaded areas indicate a learning region, given by our bound taken at equality for $\epsilon\in[0,0.7]$. The bound is seen to be tight before and around the onset of learning even for a single realization. Notably, we do not expect the bound to be tight when the feature is already learned well, but to predict the minimum required number of samples for learning.
  • Figure 2: (Theory predicts spectral bias on real-world datasets) The (test) learnability (dots) together with the bound on the cross-dataset learnability bound in Eq. \ref{['eq:learnability_bound_main']} (dashed). The shaded learning region indicated values of $P$ given by the bound in Eq. \ref{['eq:main_result']} for $0\leq\epsilon\leq0.7$. In most cases, the dashed bound and shaded learning regions give a good estimation of the sample complexity of the features.
  • Figure 3: (Cross-dataset learnability approximates the learnability) When the auxiliary distribution $q$ is similar to the data distribution the cross-dataset learnability (dots) approximates the learnability (stars). We use PCA whitening to bring the datasets' (CIFAR-10, Fashion MNIST, MNIST) distributions closer to the auxiliary distribution $q$ (uniform on the hypersphere $\mathbb{S}^17$). The shaded learning regions give a good indication of the sample complexity of the features.

Theorems & Definitions (5)

  • Proposition 2.1
  • Theorem 3.1
  • Corollary 3.2
  • Definition 3.3: Kernel Symmetry
  • Definition 3.4: Dataset Symmetry