On the Nystrom Approximation for Preconditioning in Kernel Machines

Amirhesam Abedsoltan; Parthe Pandit; Luis Rademacher; Mikhail Belkin

On the Nystrom Approximation for Preconditioning in Kernel Machines

Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

TL;DR

This work tackles the conditioning challenge in scalable kernel methods by analyzing Nyström-based approximations to spectral preconditioners for gradient-based optimization. It establishes a rigorous bound showing that a Nyström sample size of $s = \Omega\left(\frac{\log^4 n}{\varepsilon^4}\right)$ suffices to ensure $\kappa(\mathcal{P}_{s,q}^{1/2}\mathcal{K}\mathcal{P}_{s,q}^{1/2}) \le (1+\varepsilon)^4 \kappa(\mathcal{P}_q\mathcal{K})$ with high probability, i.e., near-exact speed-ups with substantially reduced storage and setup costs. The results provide concrete guidance for designing scalable preconditioned kernel methods (e.g., EigenPro variants) on large datasets by balancing sample size, accuracy, and computational overhead. Practically, this work supports efficient training of kernel machines on big data through Nyström-based spectral preconditioning and contributes to the theoretical foundations of scalable kernel optimization.

Abstract

Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

On the Nystrom Approximation for Preconditioning in Kernel Machines

TL;DR

suffices to ensure

with high probability, i.e., near-exact speed-ups with substantially reduced storage and setup costs. The results provide concrete guidance for designing scalable preconditioned kernel methods (e.g., EigenPro variants) on large datasets by balancing sample size, accuracy, and computational overhead. Practically, this work supports efficient training of kernel machines on big data through Nyström-based spectral preconditioning and contributes to the theoretical foundations of scalable kernel optimization.

Abstract

Paper Structure (18 sections, 10 theorems, 46 equations, 2 tables)

This paper contains 18 sections, 10 theorems, 46 equations, 2 tables.

INTRODUCTION
Spectral preconditioning.
Nyström approximation.
Main contribution
Organization:
PRELIMINARIES
Square-root operator:
Eigenvalue thresholding:
PROBLEM SETUP
Gradient Descent and preconditioning
Preconditioned gradient descent:
Approximated preconditioner:
Speed-up of PGD over GD
MAIN RESULT
Speed-up of nPGD over GD
...and 3 more sections

Key Result

Proposition 1

Let $\bm{e}=(e_{j})\in\mathbb{R}^n$ be an eigenvector of the matrix $\left(K(x_j,x_k)\right)_{1\leq i,j\leq n},$ with eigenvalue $n\lambda$, then $\psi = \sum_{j=1}^n e_j K(\cdot, x_j)$ is an eigenfunction of $\mathcal{K},$ with eigenvalue $\lambda.$

Theorems & Definitions (20)

Remark 1
Proposition 1
proof
Theorem 2
Lemma 3
proof
Corollary 4: of rosasco2010learning
Proposition 5
proof
Proposition 6
...and 10 more

On the Nystrom Approximation for Preconditioning in Kernel Machines

TL;DR

Abstract

On the Nystrom Approximation for Preconditioning in Kernel Machines

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (20)