Table of Contents
Fetching ...

Kernel Multigrid: Accelerate Back-fitting via Sparse Gaussian Process Regression

Lu Zou, Liang Ding

TL;DR

The paper addresses the scalability challenge of training Additive Gaussian Processes with Bayesian Back-fitting by proving a fundamental convergence lower bound and introducing Kernel Multigrid (KMG). Leveraging Kernel Packets (KP) for efficient one-dimensional GP computations, it shows Back-fitting requires at least $O(n\log n)$ iterations, then demonstrates how KMG, combining Back-fitting with Sparse GPR on residuals, achieves $O(\log n)$ iterations while keeping per-iteration costs at $O(n\log n)$ time and $O(n)$ space. Theoretical guarantees hinge on a solid approximation property of sparse additive GPR and a smoothing analysis of the Back-fit operator coupled with coarse-grid corrections. Numerical experiments on synthetic and real data corroborate the theory, with KMG markedly accelerating convergence and accurately recovering per-dimension contributions using only a handful of inducing points. The work thus provides a practical pathway to scalable, interpretable, high-dimensional additive GP modeling.

Abstract

Additive Gaussian Processes (GPs) are popular approaches for nonparametric feature selection. The common training method for these models is Bayesian Back-fitting. However, the convergence rate of Back-fitting in training additive GPs is still an open problem. By utilizing a technique called Kernel Packets (KP), we prove that the convergence rate of Back-fitting is no faster than $(1-\mathcal{O}(\frac{1}{n}))^t$, where $n$ and $t$ denote the data size and the iteration number, respectively. Consequently, Back-fitting requires a minimum of $\mathcal{O}(n\log n)$ iterations to achieve convergence. Based on KPs, we further propose an algorithm called Kernel Multigrid (KMG). This algorithm enhances Back-fitting by incorporating a sparse Gaussian Process Regression (GPR) to process the residuals after each Back-fitting iteration. It is applicable to additive GPs with both structured and scattered data. Theoretically, we prove that KMG reduces the required iterations to $\mathcal{O}(\log n)$ while preserving the time and space complexities at $\mathcal{O}(n\log n)$ and $\mathcal{O}(n)$ per iteration, respectively. Numerically, by employing a sparse GPR with merely 10 inducing points, KMG can produce accurate approximations of high-dimensional targets within 5 iterations.

Kernel Multigrid: Accelerate Back-fitting via Sparse Gaussian Process Regression

TL;DR

The paper addresses the scalability challenge of training Additive Gaussian Processes with Bayesian Back-fitting by proving a fundamental convergence lower bound and introducing Kernel Multigrid (KMG). Leveraging Kernel Packets (KP) for efficient one-dimensional GP computations, it shows Back-fitting requires at least iterations, then demonstrates how KMG, combining Back-fitting with Sparse GPR on residuals, achieves iterations while keeping per-iteration costs at time and space. Theoretical guarantees hinge on a solid approximation property of sparse additive GPR and a smoothing analysis of the Back-fit operator coupled with coarse-grid corrections. Numerical experiments on synthetic and real data corroborate the theory, with KMG markedly accelerating convergence and accurately recovering per-dimension contributions using only a handful of inducing points. The work thus provides a practical pathway to scalable, interpretable, high-dimensional additive GP modeling.

Abstract

Additive Gaussian Processes (GPs) are popular approaches for nonparametric feature selection. The common training method for these models is Bayesian Back-fitting. However, the convergence rate of Back-fitting in training additive GPs is still an open problem. By utilizing a technique called Kernel Packets (KP), we prove that the convergence rate of Back-fitting is no faster than , where and denote the data size and the iteration number, respectively. Consequently, Back-fitting requires a minimum of iterations to achieve convergence. Based on KPs, we further propose an algorithm called Kernel Multigrid (KMG). This algorithm enhances Back-fitting by incorporating a sparse Gaussian Process Regression (GPR) to process the residuals after each Back-fitting iteration. It is applicable to additive GPs with both structured and scattered data. Theoretically, we prove that KMG reduces the required iterations to while preserving the time and space complexities at and per iteration, respectively. Numerically, by employing a sparse GPR with merely 10 inducing points, KMG can produce accurate approximations of high-dimensional targets within 5 iterations.
Paper Structure (28 sections, 14 theorems, 146 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 28 sections, 14 theorems, 146 equations, 6 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Let $\boldsymbol{X}$ be a LHD and $\boldsymbol{Y}$ be generated by additive GP with kernel $k=\sum_{d=1}^Dk_d$ where each $k_d$ satisfies Assumption assump:kernel. Let $\boldsymbol{u}$ be the outputs by Algorithm alg:bayes_backfit with input $(\boldsymbol{X},\boldsymbol{Y})$ and iteration number $t$

Figures (6)

  • Figure 1: Left: the addition of five Matérn-${3}/{2}$ kernels $a_j k(\cdot,x_j)$ (colored lines, without compact supports) leads to a KP (black line, with a compact support); Right: converting 10 Matérn-${3}/{2}$ kernel functions $\{k(\cdot,x_i)\}_{i=1}^{10}$ to 10 KPs, where each KP is non-zeron on at most three points in $\{x_i\}_{i=1}^{10}$.
  • Figure 2: $\sum_i\phi_i(x^*_j)$ can be normalized to $1$ for any $x^*_j$, as KPs induced by $\boldsymbol{X}^*={ih}$ at different points have identical values.
  • Figure 3: Upper row: log of error decreases with number of iterations; lower row: error ratio $\|\boldsymbol{\varepsilon}^{(t)}\| /\|\boldsymbol{\varepsilon}^{(t)-1}\|$ is close to our lower bound
  • Figure 4: Experiments with Matérn-${1}/{2}$. Upper row: logarithm of the error for the four competing algorithms.. Middle row: the resulting prediction curves for KMG and Back-fitting compared to the target function and the underlying hidden function $\mathcal{G}_d$, when $\boldsymbol{X}_n$ is from a LHD. Lower row: the resulting prediction curves for KMG and Back-fitting compared to the target function and the underlying hidden function $\mathcal{G}_d$, when $\boldsymbol{X}_n$ is from a random design.
  • Figure 5: Experiments with Matérn-${3}/{2}$. Upper row: logarithm of the error for the four competing algorithms. Middle row: the resulting prediction curves for KMG and Back-fitting compared to the target function and the underlying hidden function $\mathcal{G}_d$, when $\boldsymbol{X}_n$ is from a LHD. Lower row: the resulting prediction curves for KMG and Back-fitting compared to the target function and the underlying hidden function $\mathcal{G}_d$, when $\boldsymbol{X}_n$ is from a random design.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Proposition 2: Proposition 1 ding2022sample
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Theorem 6
  • Theorem 7: Approximation Property
  • Remark 8
  • Remark 9
  • Lemma 10: Smoothing Property
  • ...and 9 more