Table of Contents
Fetching ...

Convergence rates of non-stationary and deep Gaussian process regression

Conor Osborne, Aretha L. Teckentrup

TL;DR

The paper provides a rigorous convergence analysis for Bayesian GP regression with non-stationary and deep priors, showing that the RKHSs of warping and mixture kernels are norm-equivalent to Sobolev spaces under suitable regularity. It derives posterior-convergence rates for the non-stationary GP and, using truncated deep/wide GP constructions, for DGPs and WGPs, with optimal rates in the noise-free regime and meaningful bounds under noisy data when the noise vanishes appropriately. The results rely on explicit RKHS characterizations, sample-path regularity, and careful handling of hyper-parameter estimation via empirical Bayes, offering theoretical guarantees for practical non-stationary priors in black-box settings. The findings support the use of non-stationary priors in emulation and surrogate modeling, while clarifying when and how convergence is guaranteed and at what rates, depending on data design and function regularity. The work highlights both the robustness of non-stationary priors and the challenges of fully characterizing convolution-based non-stationarity and hyperparameter estimation in deep structures.

Abstract

The focus of this work is the convergence of non-stationary and deep Gaussian process regression. More precisely, we follow a Bayesian approach to regression or interpolation, where the prior placed on the unknown function $f$ is a non-stationary or deep Gaussian process, and we derive convergence rates of the posterior mean to the true function $f$ in terms of the number of observed training points. In some cases, we also show convergence of the posterior variance to zero. The only assumption imposed on the function $f$ is that it is an element of a certain reproducing kernel Hilbert space, which we in particular cases show to be norm-equivalent to a Sobolev space. Our analysis includes the case of estimated hyper-parameters in the covariance kernels employed, both in an empirical Bayes' setting and the particular hierarchical setting constructed through deep Gaussian processes. We consider the settings of noise-free or noisy observations on deterministic or random training points. We establish general assumptions sufficient for the convergence of deep Gaussian process regression, along with explicit examples demonstrating the fulfilment of these assumptions. Specifically, our examples require that the Hölder or Sobolev norms of the penultimate layer are bounded almost surely.

Convergence rates of non-stationary and deep Gaussian process regression

TL;DR

The paper provides a rigorous convergence analysis for Bayesian GP regression with non-stationary and deep priors, showing that the RKHSs of warping and mixture kernels are norm-equivalent to Sobolev spaces under suitable regularity. It derives posterior-convergence rates for the non-stationary GP and, using truncated deep/wide GP constructions, for DGPs and WGPs, with optimal rates in the noise-free regime and meaningful bounds under noisy data when the noise vanishes appropriately. The results rely on explicit RKHS characterizations, sample-path regularity, and careful handling of hyper-parameter estimation via empirical Bayes, offering theoretical guarantees for practical non-stationary priors in black-box settings. The findings support the use of non-stationary priors in emulation and surrogate modeling, while clarifying when and how convergence is guaranteed and at what rates, depending on data design and function regularity. The work highlights both the robustness of non-stationary priors and the challenges of fully characterizing convolution-based non-stationarity and hyperparameter estimation in deep structures.

Abstract

The focus of this work is the convergence of non-stationary and deep Gaussian process regression. More precisely, we follow a Bayesian approach to regression or interpolation, where the prior placed on the unknown function is a non-stationary or deep Gaussian process, and we derive convergence rates of the posterior mean to the true function in terms of the number of observed training points. In some cases, we also show convergence of the posterior variance to zero. The only assumption imposed on the function is that it is an element of a certain reproducing kernel Hilbert space, which we in particular cases show to be norm-equivalent to a Sobolev space. Our analysis includes the case of estimated hyper-parameters in the covariance kernels employed, both in an empirical Bayes' setting and the particular hierarchical setting constructed through deep Gaussian processes. We consider the settings of noise-free or noisy observations on deterministic or random training points. We establish general assumptions sufficient for the convergence of deep Gaussian process regression, along with explicit examples demonstrating the fulfilment of these assumptions. Specifically, our examples require that the Hölder or Sobolev norms of the penultimate layer are bounded almost surely.
Paper Structure (64 sections, 51 theorems, 166 equations, 12 figures, 1 table)

This paper contains 64 sections, 51 theorems, 166 equations, 12 figures, 1 table.

Key Result

Proposition 2.1

If $k_s$ is positive semi-definite, then $k_{\mathop{\mathrm{warp}}\nolimits}^{w,k_s}$ is positive semi-definite. If $w$ is injective and $k_s$ is positive definite, then $k_{\mathop{\mathrm{warp}}\nolimits}^{w,k_s}$ is positive definite.

Figures (12)

  • Figure 1: Sample paths from $\mathcal{GP}(0,k_{\text{Mat}(5/2)})$, with $\{\sigma^2, \lambda\}=\{1, 0.5\}$.
  • Figure 2: Sample paths from $\mathcal{GP}(0,k_{\text{Mat}(5/2)})$, with $\{\sigma^2, \lambda\}=\{1, 3\}$.
  • Figure 3: Sample paths from $\mathcal{GP}(0, k_{\mathop{\mathrm{warp}}\nolimits}^{w,k_s})$, with stationary kernel $k_{\text{Mat}(5/2)}$ and $w(u)=(u+1.5)^4$.
  • Figure 4: Sample paths from $\mathcal{GP}(0, k_{\mathop{\mathrm{mix}}\nolimits}^{\{\sigma_\ell,k_{\ell}\}_{\ell=1}^2})$, with stationary kernels $k_1=k_{\text{Mat}(\infty)}$ with $\lambda = 1$ and $k_2=k_{\text{Mat}(\infty)}$ with $\lambda = 0.1$, and $\sigma_1(u)=\mathbbm{1}_{\{u<0\}}$, $\sigma_2(u)=\mathbbm{1}_{\{u\geq0\}}$.
  • Figure 5: Samples from $\mathcal{GP}(0, k_{\mathop{\mathrm{conv}}\nolimits}^{{\lambda_{a}},k_i})$ with stationary Gaussian kernel and ${\lambda_{a}}(u)=\mathbbm{1}_{\{u<0\}}+\mathbbm{1}_{\{u\geq0\}}/100$.
  • ...and 7 more figures

Theorems & Definitions (57)

  • Proposition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Definition 3.1
  • Proposition 3.2: Teckentrup2019
  • Proposition 3.3
  • Proposition 3.4
  • Remark 3.5
  • Proposition 3.6
  • Theorem 3.7: RKHS for warping kernel
  • ...and 47 more