Table of Contents
Fetching ...

On the Saturation Effects of Spectral Algorithms in Large Dimensions

Weihao Lu, Haobo Zhang, Yicheng Li, Qian Lin

TL;DR

This work analyzes saturation phenomena for spectral algorithms in the large-dimension regime where $n$ scales as $d^{\gamma}$. It proves that kernel gradient flow with early stopping attains the minimax rate up to polylog factors, while kernel ridge regression can be strictly suboptimal when the regression function is sufficiently smooth ($s>1$). By formulating exact convergence rates for a broad class of analytic spectral algorithms with qualification $\tau$, the paper uncovers saturation (for $s>\tau$), periodic plateau behavior (for $0<s\le 2\tau$), and a polynomial-approximation barrier as $s\to 0$, all in the context of inner-product kernels on the sphere. These results unify fixed-d saturation phenomena with large-d phenomena (multiple descent, plateaus, and barriers) and have implications for understanding high-dimensional kernel methods and the lazy regime of neural networks through NTK-like kernels. The findings establish kernel gradient flow as minimax-optimal in large dimensions and delineate precise regimes where KRR cannot achieve the minimax lower bound.

Abstract

The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where $n \asymp d^γ$. More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification $τ$'s. In particular, we find that these exact rate curves (varying along $γ$) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition $s>τ$ while it occurs in fixed dimensional setting as long as $s>2τ$).

On the Saturation Effects of Spectral Algorithms in Large Dimensions

TL;DR

This work analyzes saturation phenomena for spectral algorithms in the large-dimension regime where scales as . It proves that kernel gradient flow with early stopping attains the minimax rate up to polylog factors, while kernel ridge regression can be strictly suboptimal when the regression function is sufficiently smooth (). By formulating exact convergence rates for a broad class of analytic spectral algorithms with qualification , the paper uncovers saturation (for ), periodic plateau behavior (for ), and a polynomial-approximation barrier as , all in the context of inner-product kernels on the sphere. These results unify fixed-d saturation phenomena with large-d phenomena (multiple descent, plateaus, and barriers) and have implications for understanding high-dimensional kernel methods and the lazy regime of neural networks through NTK-like kernels. The findings establish kernel gradient flow as minimax-optimal in large dimensions and delineate precise regimes where KRR cannot achieve the minimax lower bound.

Abstract

The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where . More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification 's. In particular, we find that these exact rate curves (varying along ) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition while it occurs in fixed dimensional setting as long as ).

Paper Structure

This paper contains 39 sections, 39 theorems, 179 equations, 5 figures.

Key Result

Theorem 1.1

Let $s>0$, $\tau \geq 1$, and $\gamma>0$ be fixed real numbers. Denote $p$ as the integer satisfying $\gamma \in [p(s+1), (p+1)(s+1))$. Then under certain conditions, the excess risk of large-dimensional spectral algorithm with qualification $\tau$ satisfies where $\tilde{s} = \min\{s, 2\tau\}$.

Figures (5)

  • Figure 1: Convergence rates of spectral algorithm with qualification $\tau=2$ in Theorem \ref{['thm:kernel_methods_bounds']}, Theorem \ref{['thm:kernel_methods_bounds_sat']}, and corresponding minimax lower rates in Theorem \ref{['thm:modified_minimax_lower_bound']} with respect to dimension $d$. We present four graphs corresponding to four kinds of source conditions: $s = 0.01, 1, 3, 5$. The x-axis represents asymptotic scaling, $\gamma: n \asymp d^{\gamma}$; the y-axis represents the convergence rate of excess risk, $r: \text{Excess risk} \asymp d^{r}$.
  • Figure 2: Convergence rates of spectral algorithms with qualification $\tau=1$ (KRR), $\tau=2$ (iterated ridge regression), $\tau=4$ (iterated ridge regression), and $\tau=\infty$ (kernel gradient flow) in Theorem \ref{['thm:kernel_methods_bounds']}, Theorem \ref{['thm:kernel_methods_bounds_sat']}, and corresponding minimax lower rates in Theorem \ref{['thm:modified_minimax_lower_bound']} with respect to dimension $d$. We present four graphs corresponding to four kinds of source conditions: $s = 0.01, 1, 3, 5$. The x-axis represents asymptotic scaling, $\gamma: n \asymp d^{\gamma}$; the y-axis represents the convergence rate of excess risk, $r: \text{Excess risk} \asymp d^{r}$.
  • Figure 3: Results of Experiment 1. We repeated each experiment 50 times and reported the average excess risk for (a) kernel gradient flow (labeled as "kernel regression" in our reports) and (b) kernel ridge regression (KRR) on 1000 test samples. We randomly selected $u_{1}, u_{2}, u_{3}$ and kept them fixed for each repeat. We choose the stopping time $t$ in kernel gradient flow as $C_{1} n^{0.5}$, where $C_{1} \in \{0.001, 0.01, 0.1, 1, 10, 100, 1000\}$. We use 5-fold cross-validation to select the regularization parameter $\lambda$ in kernel ridge regression. The alternative values of $\lambda$ in cross-validation are $C_{2} n^{-C_{3}}$, where $C_{2} \in \{0.001, 0.005, 0.01, 0.1, 0.5, 1, 2, 5, 10, 40, 100, 300, 1000\}, C_{3} \in \{ 0.1, 0.2, \ldots, 1.5\}$.
  • Figure 4: A similar plot as Figure \ref{['fig:3_1']}, but with the RBF kernel.
  • Figure 5: Results of Experiment 2. It can be seen that the best rate of excess risk for KRR is slower than that of kernel gradient flow.

Theorems & Definitions (72)

  • Theorem 1.1: Restate Theorem 4.1 and 4.2, non-rigorous
  • Remark 2.1
  • Proposition 2.2
  • Proposition 2.3: Lower bound on the minimax rate
  • Theorem 3.1: Kernel gradient flow
  • Remark 3.2
  • Theorem 3.3: Improved minimax lower bound
  • Example 1: Kernel ridge regression
  • Example 2: Kernel gradient flow
  • Theorem 4.1
  • ...and 62 more