Robust, randomized preconditioning for kernel ridge regression

Mateo Díaz; Ethan N. Epperly; Zachary Frangella; Joel A. Tropp; Robert J. Webber

Robust, randomized preconditioning for kernel ridge regression

Mateo Díaz, Ethan N. Epperly, Zachary Frangella, Joel A. Tropp, Robert J. Webber

TL;DR

This work addresses scalable kernel ridge regression by introducing two randomized preconditioners: RPCholesky for full-data KRR and KRILL for restricted KRR. RPCholesky leverages a low-rank Nyström-like approximation with random pivots to form P = Ahat + μI, enabling O(N^2) total cost and near-constant CG iterations under favorable eigenvalue decay, with rigorous guarantees linked to the μ-tail rank. KRILL uses a sparse random sign embedding to sketch the Gram matrix of centers, constructing P = B^*B + μA(S,S) and delivering robust convergence for any μ and kernel under the stated embedding conditions, with cost O((N+k^2)k log k). The methods demonstrate strong empirical performance across diverse datasets (including quantum chemistry HOMO energy tasks and SUSY particle detection) and provide theoretical convergence guarantees that advance the reliability of preconditioned CG for KRR in large-scale settings. Collectively, RPCholesky and KRILL offer practical, robust, and scalable tools for solving KRR problems in scientific computing and data-driven modeling scenarios.

Abstract

This paper investigates preconditioned conjugate gradient techniques for solving kernel ridge regression (KRR) problems with a medium to large number of data points ($10^4 \leq N \leq 10^7$), and it describes two methods with the strongest guarantees available. The first method, RPCholesky preconditioning, accurately solves the full-data KRR problem in $O(N^2)$ arithmetic operations, assuming sufficiently rapid polynomial decay of the kernel matrix eigenvalues. The second method, KRILL preconditioning, offers an accurate solution to a restricted version of the KRR problem involving $k \ll N$ selected data centers at a cost of $O((N + k^2) k \log k)$ operations. The proposed methods efficiently solve a range of KRR problems, making them well-suited for practical applications.

Robust, randomized preconditioning for kernel ridge regression

TL;DR

Abstract

This paper investigates preconditioned conjugate gradient techniques for solving kernel ridge regression (KRR) problems with a medium to large number of data points (

), and it describes two methods with the strongest guarantees available. The first method, RPCholesky preconditioning, accurately solves the full-data KRR problem in

arithmetic operations, assuming sufficiently rapid polynomial decay of the kernel matrix eigenvalues. The second method, KRILL preconditioning, offers an accurate solution to a restricted version of the KRR problem involving

selected data centers at a cost of

operations. The proposed methods efficiently solve a range of KRR problems, making them well-suited for practical applications.

Paper Structure (34 sections, 2 theorems, 52 equations, 11 figures, 1 table, 5 algorithms)

This paper contains 34 sections, 2 theorems, 52 equations, 11 figures, 1 table, 5 algorithms.

Motivation
RPCholesky.
KRILL.
Broader context.
Plan for paper
Algorithms and best practices
Full-data kernel ridge regression
RPCholesky preconditioning
Empirical performance
Theoretical guarantees
Restricted kernel ridge regression
KRILL preconditioning
Empirical performance
Theoretical guarantees
Background and comparisons with other preconditioners
...and 19 more sections

Key Result

Theorem 2.2

Fix a failure probability $\delta \in (0,1)$ and an error tolerance $\varepsilon \in (0,1)$. Let $\boldsymbol{A}$ be any positive-semidefinite matrix, and let $\mu > 0$ be a positive number. Construct a random approximation $\boldsymbol{\widehat{A}}$ using RPCholesky with block size $B = 1$ and appr With probability at least $1 - \delta$, the RPCholesky preconditioner $\boldsymbol{P} = \boldsymbol

Figures (11)

Figure 1: Fraction of solved problems versus number of CG iterations for the 20 KRR problem instances in \ref{['tab:datasets']}.
Figure 2: Relative residual versus number of CG iterations for the problem with the fastest (COMET_MC_SAMPLE, left) and slowest (w8a, right) CG convergence when $\mu/N = 10^{-7}$. Note the different vertical axis scales.
Figure 3: Eigenvalue decay for the 20 KRR problems in \ref{['tab:datasets']}. Left panel shows eigenvalues of the matrix $\boldsymbol{A} + \mu \mathbf{I}$ before RPCholesky preconditioning. Middle panel shows eigenvalues of the matrix $\mu \boldsymbol{P}^{-1/2} (\boldsymbol{A} + \mu \mathbf{I}) \boldsymbol{P}^{-1/2}$ after RPCholesky preconditioning with $r = 1000$. Right panel shows eigenvalues $\lambda_{i+r}(\boldsymbol{A} + \mu \mathbf{I})$ resulting from the mathematically optimal rank-$r$ preconditioner. Dashed lines indicate the regularization parameter $\mu = 10^{-7} N$.
Figure 4: Fraction of solved problems versus number of CG iterations for the 20 kernel problems in \ref{['tab:datasets']}.
Figure 5: Relative residual versus number of CG iterations for the problems with the fastest (COMET_MC_SAMPLE, left) and slowest (creditcard, right) CG convergence when $\mu/N = 10^{-6}$. Thick lines show the median and shaded regions show the $20\%$--$80\%$ error quantiles over 100 random trials.
...and 6 more figures

Theorems & Definitions (3)

Definition 2.1: Tail rank
Theorem 2.2: RPCholesky preconditioning
Theorem 2.3: KRILL performance guarantee

Robust, randomized preconditioning for kernel ridge regression

TL;DR

Abstract

Robust, randomized preconditioning for kernel ridge regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)