Table of Contents
Fetching ...

Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition

Yunqin Zhu, Henry Shaowu Yuchi, Yao Xie

TL;DR

This paper addresses the need for expressive yet scalable Gaussian process kernels by introducing the Deep Basis Kernel (DBK), a fully data-driven, low-rank kernel representation built from neural basis functions via Mercer's theorem. By construction, DBK supports exact GP inference in linear time without inducing points and enables scalable weight-space variational training for large datasets, complemented by a variance-correction procedure to guard against overconfident uncertainty. The authors demonstrate that DBK achieves improved predictive accuracy and better uncertainty calibration compared with full GP, sparse GP, and deep kernel learning variants across synthetic and real-world regression tasks, while delivering strong computational efficiency. The work provides a cohesive framework that unifies exact and variational inference for scalable, data-driven kernels, with practical impact on large-scale GP applications.

Abstract

Kernels are key to encoding prior beliefs and data structures in Gaussian process (GP) models. The design of expressive and scalable kernels has garnered significant research attention. Deep kernel learning enhances kernel flexibility by feeding inputs through a neural network before applying a standard parametric form. However, this approach remains limited by the choice of base kernels, inherits high inference costs, and often demands sparse approximations. Drawing on Mercer's theorem, we introduce a fully data-driven, scalable deep kernel representation where a neural network directly represents a low-rank kernel through a small set of basis functions. This construction enables highly efficient exact GP inference in linear time and memory without invoking inducing points. It also supports scalable mini-batch training based on a principled variational inference framework. We further propose a simple variance correction procedure to guard against overconfidence in uncertainty estimates. Experiments on synthetic and real-world data demonstrate the advantages of our deep kernel GP in terms of predictive accuracy, uncertainty quantification, and computational efficiency.

Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition

TL;DR

This paper addresses the need for expressive yet scalable Gaussian process kernels by introducing the Deep Basis Kernel (DBK), a fully data-driven, low-rank kernel representation built from neural basis functions via Mercer's theorem. By construction, DBK supports exact GP inference in linear time without inducing points and enables scalable weight-space variational training for large datasets, complemented by a variance-correction procedure to guard against overconfident uncertainty. The authors demonstrate that DBK achieves improved predictive accuracy and better uncertainty calibration compared with full GP, sparse GP, and deep kernel learning variants across synthetic and real-world regression tasks, while delivering strong computational efficiency. The work provides a cohesive framework that unifies exact and variational inference for scalable, data-driven kernels, with practical impact on large-scale GP applications.

Abstract

Kernels are key to encoding prior beliefs and data structures in Gaussian process (GP) models. The design of expressive and scalable kernels has garnered significant research attention. Deep kernel learning enhances kernel flexibility by feeding inputs through a neural network before applying a standard parametric form. However, this approach remains limited by the choice of base kernels, inherits high inference costs, and often demands sparse approximations. Drawing on Mercer's theorem, we introduce a fully data-driven, scalable deep kernel representation where a neural network directly represents a low-rank kernel through a small set of basis functions. This construction enables highly efficient exact GP inference in linear time and memory without invoking inducing points. It also supports scalable mini-batch training based on a principled variational inference framework. We further propose a simple variance correction procedure to guard against overconfidence in uncertainty estimates. Experiments on synthetic and real-world data demonstrate the advantages of our deep kernel GP in terms of predictive accuracy, uncertainty quantification, and computational efficiency.

Paper Structure

This paper contains 35 sections, 1 theorem, 38 equations, 4 figures, 1 table.

Key Result

Theorem 1

Suppose that we observe the training data $({\bm{X}},{\bm{y}})$ from a fixed ground-truth function $f_{\rm gt}: {\mathcal{X}}\to\mathbb{R}$ and the noise variance $\sigma_{\epsilon}^2$ is known. Denote the true function values as ${\bm{f}}_{\rm gt} = [f_{\rm gt}({\bm{x}}_1), \ldots, f_{\rm gt}({\bm{

Figures (4)

  • Figure 1: Example of fitting a 1-D function. We compare the posterior (left) and the kernel $k(x_1,x_2)$ (right) of GP with RBF kernel, DKL with RBF base kernel, and the proposed DBK with or without variance correction. Standard RBF produces calibrated uncertainty but an undesirable function space. DKL learns a rich kernel structure but is limited by the bias of RBF as large entries concentrate around the diagonal. DBK learns a fully data-driven kernel, but without variance correction the predictions are overconfident. Variance correction fixes this issue and leads to a regularized posterior covering the true function.
  • Figure 2: The architecture of the proposed DBK. A shared NN ${\bm{\phi}}$ first maps two inputs ${\bm{x}}_1$ and ${\bm{x}}_2$ to two sets of features $\{\phi_i({\bm{x}}_1)\}_{i=1}^r$ and $\{\phi_i({\bm{x}}_2)\}_{i=1}^r$, respectively. Then, the kernel value $k({\bm{x}}_1,{\bm{x}}_2)$ is calculated by summing up the products between two sets of learned features.
  • Figure 3: Performance and wall-clock time for exact GP models on the synthetic 1-D dataset vs. the sample size $n$. The shaded area indicates $\pm 1$ standard deviation over 10 random seeds. Full GP and DKL fail at $n>10^4$ due to out-of-memory errors.
  • Figure 4: Posterior mean, variance, and kernel function learned by SVI models on the mobile internet quality dataset. We report the performance metrics in the captions.

Theorems & Definitions (2)

  • Theorem 1
  • proof