Table of Contents
Fetching ...

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Bernhard Stankewitz, Botond Szabo

TL;DR

This work develops and analyzes computation-aware Gaussian process posteriors that arise from probabilistic-numerics updates using empirical eigenvectors, Lanczos approximations, or conjugate gradient directions. It establishes minimax contraction rates for the three schemes (EVGP,LGP,CGGP) by coupling classical spectral concentration with KL-based contraction arguments, under detailed eigenvalue decay and moment assumptions. The key theoretical advance is linking fully numerical Lanczos/CG posteriors to variational Bayes formulations via Krylov spaces, and proving that, with enough iterations, these approximations recover the same inference quality as the full GP posterior at substantially lower computational cost. Numerical experiments with Matérn and squared exponential kernels corroborate the theory, showing substantial speedups (O(m n^2) vs O(n^3)) while preserving accurate posterior means and credible sets, thus offering scalable GP inference without sacrificing statistical guarantees.

Abstract

Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + σ^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

TL;DR

This work develops and analyzes computation-aware Gaussian process posteriors that arise from probabilistic-numerics updates using empirical eigenvectors, Lanczos approximations, or conjugate gradient directions. It establishes minimax contraction rates for the three schemes (EVGP,LGP,CGGP) by coupling classical spectral concentration with KL-based contraction arguments, under detailed eigenvalue decay and moment assumptions. The key theoretical advance is linking fully numerical Lanczos/CG posteriors to variational Bayes formulations via Krylov spaces, and proving that, with enough iterations, these approximations recover the same inference quality as the full GP posterior at substantially lower computational cost. Numerical experiments with Matérn and squared exponential kernels corroborate the theory, showing substantial speedups (O(m n^2) vs O(n^3)) while preserving accurate posterior means and credible sets, thus offering scalable GP inference without sacrificing statistical guarantees.

Abstract

Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix , which requires up to operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.
Paper Structure (22 sections, 23 theorems, 196 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 23 theorems, 196 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Lemma 2.2

Given actions $s_{j}: = \widehat{u}_{j}$, $j \le m$, in Algorithm alg_GPApproximation, the approximate precision matrix $C_{m} = C_{m}^{ \normalfont \text{EV} }$ of $K_{ \sigma }^{-1}$ is given by

Figures (8)

  • Figure 1: Equivalence of LGP and CGGP for $n = 10$ observations from model \ref{['eq_N_MaternModel']}. LGP and CGGP posteriors $\Psi_{m}$ with $m = 5$.
  • Figure 2: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. EVGP and CGGP posteriors $\Psi_{m}$ with $m = 40$.
  • Figure 3: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. CGGP posteriors $\Psi^{ \text{CG} }_{m}$ with $m = 80$ and $20$, respectively.
  • Figure 4: Simulation results for the Matern kernel with $n = 3000$ observations from model \ref{['eq_N_MaternModel']}. CGGP posteriors $\Psi^{ \text{CG} }_{m}$ with $m = 160$. Log-log plot of computation times for the true posterior compared with the CGGP posterior for the optimal choice of $m$. $n \in \{ 5000, 6000, \dots, 15000 \}$ and intercept shifted to $0$.
  • Figure 5: Equivalence of LGP and CGGP for $n = 10$ observations in the regression model with $f_0$ given in \ref{['eq_N_TrueFunctionSqrExpKernel']}. The LGP and CGGP posteriors $\Psi_{m}$ are computed with $m = 3$ iterations. The Lanczos algorithm is initialized at $w_{0} = Y / \| Y \|$.
  • ...and 3 more figures

Theorems & Definitions (49)

  • Remark 2.1: Representer weights
  • Lemma 2.2: EVGP
  • Lemma 2.3: LGP
  • Lemma 2.4: CGGP
  • Proposition 3.1: Standard contraction rate, vZantenvdVaart2008RatesForGPPriors
  • Lemma 4.1: Empirical eigenvector actions and variational Bayes
  • Theorem 4.2: Contraction rates for LGP and CGGP
  • Remark 4.3: Relation to variational Bayes
  • Remark 4.4: Inconsistency example
  • Corollary 5.1: Polynomially decaying eigenvalues
  • ...and 39 more