Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

Pratik Rathore; Zachary Frangella; Jiaming Yang; Michał Dereziński; Madeleine Udell

Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, Madeleine Udell

TL;DR

This work addresses the scalability of kernel ridge regression (KRR) on very large datasets by introducing ASkotch, a scalable, accelerated solver for full KRR. It combines sketch-and-project updates with Nyström low-rank approximations and Nesterov acceleration to achieve linear convergence, with theoretical guarantees that under favorable kernel spectra the convergence is nearly independent of conditioning. The authors derive a convergence framework based on ridge leverage scores and determinantal point processes to bound projection shrinkage and show near-optimal, log-linear runtime for kernels with modest effective dimension, while maintaining linear convergence. Empirically, ASkotch outperforms state-of-the-art methods for both full KRR and inducing-points KRR across 23 tasks, including a huge taxi dataset, demonstrating its practical impact on scalable, high-accuracy KRR in diverse domains. The work opens the door to new, large-scale applications of full KRR and suggests future directions in distributed, mixed-precision, and automated-parameter implementations. All mathematical concepts are presented with explicit notation, enabling precise adoption and extension in high-performance contexts.

Abstract

Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems that range from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, full KRR solvers are challenging to scale to large datasets: both direct (i.e., Cholesky decomposition) and iterative methods (i.e., PCG) incur prohibitive computational and storage costs. The standard approach to scale KRR to large datasets chooses a set of inducing points and solves an approximate version of the problem, inducing points KRR. However, the resulting solution tends to have worse predictive performance than the full KRR solution. In this work, we introduce a new solver, ASkotch, for full KRR that provides better solutions faster than state-of-the-art solvers for full and inducing points KRR. ASkotch is a scalable, accelerated, iterative method for full KRR that provably obtains linear convergence. Under appropriate conditions, we show that ASkotch obtains condition-number-free linear convergence. This convergence analysis rests on the theory of ridge leverage scores and determinantal point processes. ASkotch outperforms state-of-the-art KRR solvers on a testbed of 23 large-scale KRR regression and classification tasks derived from a wide range of application domains, demonstrating the superiority of full KRR over inducing points KRR. Our work opens up the possibility of as-yet-unimagined applications of full KRR across a number of disciplines.

Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

TL;DR

Abstract

Paper Structure (60 sections, 26 theorems, 123 equations, 16 figures, 5 tables, 5 algorithms)

This paper contains 60 sections, 26 theorems, 123 equations, 16 figures, 5 tables, 5 algorithms.

Introduction
Contributions
Roadmap
Notation
Preliminaries
SAP for Full KRR
SAP with Nesterov Acceleration
ASAP: Approximating the SAP Projection Step
Nyström Approximation
Randomized Nyström Approximation
Automatic Computation of the Stepsize
Ridge Leverage Score Sampling
Algorithms
Coordinate sampling distributions
Default hyperparameters
...and 45 more sections

Key Result

Lemma 4

[lemma]lem:rls_approx Given $A \in \mathbb S_{+}^n$, a positive integer $k$, and $\lambda>0$ such that the $\lambda$-effective dimension of $A$ satisfies $d^{\lambda}(A) \leq k$, there is an algorithm (BLESS) that with high probability returns 2-approximations for all $n$$\lambda$-ridge leverage sco

Figures (16)

Figure 1: Full KRR is advantagenous over inducing points KRR, even for large problems. Our method ASkotch, run with its default hyperparameters, outperforms the state-of-the-art for both full and inducing points KRR on a subsample of the taxi dataset. Falkon is limited to $m = 2 \cdot 10^4$ inducing points due to memory constraints. State-of-the-art Nyström PCG methods frangella2023randomized Gaussian Nyström frangella2023randomized and Randomly Pivoted Cholesky diaz2023robustepperly2024embrace, each with a rank $r=50$ preconditioner, fail to complete a single iteration. EigenPro 2.0 and EigenPro 3.0 (not shown) diverge on their default hyperparameters. All methods have a 24-hour time limit and are run on a single 48 GB NVIDIA RTX A6000 GPU.
Figure 2: Performance comparison between ASkotch and competitors on 10 classification and 13 regression tasks. We designate a classification problem as "solved" when the method reaches within 0.001 of the highest classification accuracy found across all the optimizer + hyperparameter combinations. We designate a regression problem as "solved" when the method reaches within 1% of the lowest MAE (in a relative sense) found across all the optimizer + hyperparameter combinations. PCG and Falkon are run in double precision. EigenPro 2.0, EigenPro 3.0, and PCG do not solve any of the regression problems within the tolerance. ASkotch outperforms the competition on both classification and regression.
Figure 3: Comparison between ASkotch and competitors on computer vision tasks. ASkotch and the competing methods all reach similarly high classification accuracies, but ASkotch achieves this accuracy in less time than the competition. The classification accuracy for PCG and Falkon sometimes peaks and then goes towards 0---this is unsurprising since Krylov methods can diverge if they are run for too many iterations.
Figure 4: Comparison between ASkotch and competitors on particle physics tasks. ASkotch reaches a similar classification accuracy as the competition on both comet_mc and susy, while taking much less time to reach this level of accuracy. However, EigenPro 2.0 and Falkon outperform ASkotch on miniboone and higgs, respectively. On the other hand, EigenPro 2.0 and EigenPro 3.0 both diverge on comet_mc, which shows that these methods do not always work well with their default hyperparameters, while ASkotch consistently provides good results with its defaults. Finally, PCG does not even reach 0.4 classification accuracy on susy and it does not complete a single iteration on higgs.
Figure 5: Comparison between ASkotch and competitors on ecological modeling and online advertising tasks. ASkotch achieves comparable or higher classification accuracy than EigenPro 3.0, PCG, and Falkon on both datasets, while requiring less time to do so. While EigenPro 2.0 is competitive with ASkotch on covtype_binary, both EigenPro 2.0 and 3.0 diverge on click_prediction.
...and 11 more figures

Theorems & Definitions (35)

Definition 1: Ridge leverage scores
Definition 2: Effective dimension and maximal degrees of freedom
Definition 3: RLS approximation
Lemma 4: rudi2018fast, Theorem 1
Definition 5: Determinantal point process
Lemma 6: derezinski2020improved, Lemma 5
Lemma 7: derezinski2021determinantal, Theorem 10
Lemma 8
Definition 9: $\textup{ARLS}_{c}^{\lambda}$-sampling
Lemma 10: Projection analysis for ARLS Sampling
...and 25 more

Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

TL;DR

Abstract

Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (35)