Table of Contents
Fetching ...

Large-Scale Gaussian Processes via Alternating Projection

Kaiwen Wu, Jonathan Wenger, Haydn Jones, Geoff Pleiss, Jacob R. Gardner

TL;DR

The paper tackles the cubic bottleneck in Gaussian process training and inference by introducing an alternating-projection solver that operates on subblocks of the kernel matrix, achieving $\, ext{O}( ) \, $ per-iteration complexity and linear convergence. By reframing linear solves in the RKHS and using a block-structured projection scheme with cached Cholesky factors, the method enables efficient mini-batching and GPU-friendly updates. Empirically, the approach yields substantial speedups over CG—up to $27\times$ for training and $72\times$ for inference—while scaling to datasets with up to $4$ million points without inducing points, and showing robustness to ill-conditioning. This leads to practical GP training and deployment on datasets far larger than before, preserving predictive performance relative to traditional CG-based methods. Overall, the work demonstrates that problem-specific iterative solvers exploiting RKHS structure can dramatically improve the scalability of dense GP methods on modern hardware.

Abstract

Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.

Large-Scale Gaussian Processes via Alternating Projection

TL;DR

The paper tackles the cubic bottleneck in Gaussian process training and inference by introducing an alternating-projection solver that operates on subblocks of the kernel matrix, achieving per-iteration complexity and linear convergence. By reframing linear solves in the RKHS and using a block-structured projection scheme with cached Cholesky factors, the method enables efficient mini-batching and GPU-friendly updates. Empirically, the approach yields substantial speedups over CG—up to for training and for inference—while scaling to datasets with up to million points without inducing points, and showing robustness to ill-conditioning. This leads to practical GP training and deployment on datasets far larger than before, preserving predictive performance relative to traditional CG-based methods. Overall, the work demonstrates that problem-specific iterative solvers exploiting RKHS structure can dramatically improve the scalability of dense GP methods on modern hardware.

Abstract

Training and inference in Gaussian processes (GPs) require solving linear systems with kernel matrices. To address the prohibitive time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to and , respectively, compared to CG.
Paper Structure (49 sections, 4 theorems, 35 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 49 sections, 4 theorems, 35 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathbf{W}^*$ be the unique solution of the linear system $\mathbf{K} \mathbf{W} = \mathbf{B}$ and $\mathbf{W}^{(t)}$ its approximation after $t$ epochs of alg:alternating_projection using the modified GS rule eq:approximate_gauss_southwell. Then it holds that where $\lVert\mathbf{W} - \mathbf{W}^*\rVert_{\mathbf{K}}^2 = \mathop{\mathrm{tr}}\nolimits([)]{(\mathbf{W} - \mathbf{W}^*)^\top \mat

Figures (7)

  • Figure 1: Convergence of alternating projection and (preconditioned) conjugate gradient. The $x$-axis is the number iterations for CG and the number epochs for alternating projection. Both methods are initialized at zero, but CG increases the residual after the first iteration. Left: While the asymptotic convergence rate of CG can be faster than alternating projection, CG does not find a better solution than alternating projection in the first $1000$ iterations. Right: CG struggles with convergence due to ill-conditioning and does not reach the tolerance $\epsilon = 1$. In contrast, alternating projection convergences. See §\ref{['sec:convergence']} for more details.
  • Figure 2: Left: Illustration of alternating projection. $s^{(1)}$ is the projection of $g = r^{(0)}$ onto the subspace spanned by $k(\mathbf{x}_1, \cdot)$ and $k(\mathbf{x}_2, \cdot)$. The residual $r^{(1)} = g - s^{(1)}$ will be projected to other coordinates in the next iteration. Right: Gauss-Southwell block selection rule results in faster convergence than random and cyclic selection rules.
  • Figure 3: Convergence of alternating projection with different batch sizes $b$ on 3droad. Left: Smaller batch sizes converge faster within the same epochs. Right: However, smaller batch sizes result in more sequential updates on the GPU and thus longer wall-clock time.
  • Figure 4: GP training with Adam on air quality. Left: As the likelihood noise $\sigma^2$ decreases in training, the kernel matrix $\mathbf{K}$ gets more ill-conditioned. Right: The $y$-axis is the number of iterations for CG and the number of epochs for alternating projection. CG is sensitive to this increased ill-conditioning, while alternating projections is robust.
  • Figure 5: Running CG and alternating projection on test-time solves $\mathbf{K}^{-1} (\mathbf{y} - \boldsymbol \mu )$. The $x$-axis is the number of iterations for CG and the number of epochs for alternating projection. Left: CG has a faster asymptotic convergence rate, but CG does not reach the test-time tolerance $\epsilon = 0.01$ much faster. Right: Alternating projection reaches the tolerance $\epsilon = 0.01$ faster despite its slower asymptotic convergence rate.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Definition 1: Projection Operator
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Theorem 1