Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Bernhard Stankewitz; Botond Szabo

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Bernhard Stankewitz, Botond Szabo

TL;DR

This work develops and analyzes computation-aware Gaussian process posteriors that arise from probabilistic-numerics updates using empirical eigenvectors, Lanczos approximations, or conjugate gradient directions. It establishes minimax contraction rates for the three schemes (EVGP,LGP,CGGP) by coupling classical spectral concentration with KL-based contraction arguments, under detailed eigenvalue decay and moment assumptions. The key theoretical advance is linking fully numerical Lanczos/CG posteriors to variational Bayes formulations via Krylov spaces, and proving that, with enough iterations, these approximations recover the same inference quality as the full GP posterior at substantially lower computational cost. Numerical experiments with Matérn and squared exponential kernels corroborate the theory, showing substantial speedups (O(m n^2) vs O(n^3)) while preserving accurate posterior means and credible sets, thus offering scalable GP inference without sacrificing statistical guarantees.

Abstract

Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + σ^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

TL;DR

Abstract

, which requires up to

operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.

Paper Structure (22 sections, 23 theorems, 196 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 23 theorems, 196 equations, 8 figures, 2 tables, 2 algorithms.

Introduction
Approximate posteriors from probabilistic numerics
A general class of algorithms
Eigenvector, Lanczos and conjugate gradient posteriors
Empirical eigenvector posterior
Lanczos eigenvector posterior (LGP)
Conjugate gradient posterior (CGGP)
Approximate posterior contraction
Main results
Examples
Polynomially decaying eigenvalues
Exponentially decaying eigenvalues
Technical analysis
Approximate contraction via Kullback-Leibler bounds
Analysis of the Lanczos approximate posterior
...and 7 more sections

Key Result

Lemma 2.2

Given actions $s_{j}: = \widehat{u}_{j}$, $j \le m$, in Algorithm alg_GPApproximation, the approximate precision matrix $C_{m} = C_{m}^{ \normalfont \text{EV} }$ of $K_{ \sigma }^{-1}$ is given by

Figures (8)

Figure 1: Equivalence of LGP and CGGP for $n = 10$ observations from model \ref{['eq_N_MaternModel']}. LGP and CGGP posteriors $\Psi_{m}$ with $m = 5$.
Figure 2: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. EVGP and CGGP posteriors $\Psi_{m}$ with $m = 40$.
Figure 3: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. CGGP posteriors $\Psi^{ \text{CG} }_{m}$ with $m = 80$ and $20$, respectively.
Figure 4: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. CGGP posteriors $\Psi^{ \text{CG} }_{m}$ with $m = 160$. Log-log plot of computation times for the true posterior compared with the CGGP posterior for the optimal choice of $m$. $n \in \{ 5000, 6000, \dots, 15000 \}$ and intercept shifted to $0$.
Figure 5: Equivalence of LGP and CGGP for $n = 10$ observations in the regression model with $f_0$ given in \ref{['eq_N_TrueFunctionSqrExpKernel']}. The LGP and CGGP posteriors $\Psi_{m}$ are computed with $m = 3$ iterations. The Lanczos algorithm is initialized at $w_{0} = Y / \| Y \|$.
...and 3 more figures

Theorems & Definitions (49)

Remark 2.1: Representer weights
Lemma 2.2: EVGP
Lemma 2.3: LGP
Lemma 2.4: CGGP
Proposition 3.1: Standard contraction rate, vZantenvdVaart2008RatesForGPPriors
Lemma 4.1: Empirical eigenvector actions and variational Bayes
Theorem 4.2: Contraction rates for LGP and CGGP
Remark 4.3: Relation to variational Bayes
Remark 4.4: Inconsistency example
Corollary 5.1: Polynomially decaying eigenvalues
...and 39 more

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

TL;DR

Abstract

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (49)