Table of Contents
Fetching ...

Asymptotic properties of Vecchia approximation for Gaussian processes

Myeongjong Kang, Florian Schäfer, Joseph Guinness, Matthias Katzfuss

Abstract

Vecchia approximation has been widely used to accurately scale Gaussian-process (GP) inference to large datasets, by expressing the joint density as a product of conditional densities with small conditioning sets. We study fixed-domain asymptotic properties of Vecchia-based GP inference for a large class of covariance functions (including Matérn covariances) with boundary conditioning. In this setting, we establish that consistency and asymptotic normality of maximum exact-likelihood estimators imply those of maximum Vecchia-likelihood estimators, and that exact GP prediction can be approximated accurately by Vecchia GP prediction, given that the size of conditioning sets grows polylogarithmically with the data size. Hence, Vecchia-based inference with quasilinear complexity is asymptotically equivalent to exact GP inference with cubic complexity. This also provides a general new result on the screening effect. Our findings are illustrated by numerical experiments, which also show that Vecchia approximation can be more accurate than alternative approaches such as covariance tapering and reduced-rank approximations.

Asymptotic properties of Vecchia approximation for Gaussian processes

Abstract

Vecchia approximation has been widely used to accurately scale Gaussian-process (GP) inference to large datasets, by expressing the joint density as a product of conditional densities with small conditioning sets. We study fixed-domain asymptotic properties of Vecchia-based GP inference for a large class of covariance functions (including Matérn covariances) with boundary conditioning. In this setting, we establish that consistency and asymptotic normality of maximum exact-likelihood estimators imply those of maximum Vecchia-likelihood estimators, and that exact GP prediction can be approximated accurately by Vecchia GP prediction, given that the size of conditioning sets grows polylogarithmically with the data size. Hence, Vecchia-based inference with quasilinear complexity is asymptotically equivalent to exact GP inference with cubic complexity. This also provides a general new result on the screening effect. Our findings are illustrated by numerical experiments, which also show that Vecchia approximation can be more accurate than alternative approaches such as covariance tapering and reduced-rank approximations.
Paper Structure (9 sections, 6 theorems, 45 equations, 6 figures)

This paper contains 9 sections, 6 theorems, 45 equations, 6 figures.

Key Result

Proposition 1

Assume (A1)--(A4). For any $\bm{\theta} \in \bm{\Theta}$, where the KL divergence is defined between Gaussian measures with the exact likelihood $p_{n}$ and Vecchia likelihood $\hat{p}_{n,m}$ under (A4). Note that the KL divergence decays faster than any fixed-order polynomial as a function of $n$.

Figures (6)

  • Figure 1: For $i = 1, 100, 400, 900, 1600$ (from left to right), positive (orange) and negative (blue) conditional correlations with the $i$th input $\mathbf{x}_i$ in the maximin ordering (red points) conditional on all (top) and $m=26$ nearest (bottom) previously ordered inputs (green points), for a GP with Matérn covariance (range $r = 0.1$ and smoothness $\nu = 2$) on a grid of size $n= 40 \times 40 = 1600$. The conditional correlations given only nearest neighbors are almost identical to corresponding conditional correlations given all past observations, which implies that the two likelihoods corresponding to the conditional correlation maps are also similar to each other. This figure is inspired by Figure 5 in Schafer2020.
  • Figure 2: Comparison of predictive performances of taper, reduced-rank, and Vecchia GP approximations at $\mathbf{x}_{n+1} = (0.5, 0.5) \in [0,1]^2$ for 400 synthetic datasets: The top left panel compares the average number of non-zero entries of covariance matrix per observation for taper approximations, the number of inducing points for reduced-rank approximation, and the average size of conditioning set for Vecchia approximation. The top-right and bottom-left panels present log-scale mean square prediction error (MSPE) and variance of the predictive distributions, respectively. The bottom right panel compares KL divergences between $p(y_{n+1} | \mathbf{y}_{1:n})$ and $\hat{p}(y_{n+1} | \mathbf{y}_{1:n})$ based on the different approximations. Note that all the parameters were assumed to be known and the $x$-axes of the panels are on a log scale.
  • Figure 3: Comparison of maximum-approximate-likelihood estimators based on taper, reduced-rank, and Vecchia approximations: The left panel compares the average number of non-zero entries of covariance matrix per observation for taper approximations, the number of inducing points for reduced-rank approximation, and the average size of conditioning set for Vecchia approximation. The center (right) panel shows RMSE for estimating the variance (range) parameter based on exact and approximate GP likelihoods for 200 synthetic datasets. Vecchia and exact-GP RMSEs were nearly identical. While estimating a parameter, the other parameter was assumed to be known. Data were generated on a regular grid of $n$ inputs on the unit square domain from a GP with Matérn covariance with variance $1$, range $0.1$, and smoothness $0.5$. Note that the $x$-axes of the panels are on a log scale.
  • Figure S1: For i = 1, 100, 400 (from left to right), positive (orange) and negative (blue) conditional correlations with the $i$th input $\mathbf{x}_i$ in the maximin ordering (red points) conditional on all (top) and $m=17$ nearest (bottom) previously ordered inputs (green points), for a GP with Matérn covariance (range $r = 0.1$ and smoothness $\nu = 2$) on a grid of size $n = 20 \times 20 = 400$. This figure is inspired by Figure 5 in Schafer2020.
  • Figure S2: For i = 1, 100, 400, 900, 1600, 2500, 3600 (from left to right), positive (orange) and negative (blue) conditional correlations with the $i$th input $\mathbf{x}_i$ in the maximin ordering (red points) conditional on all (top) and $m=32$ nearest (bottom) previously ordered inputs (green points), for a GP with Matérn covariance (range $r = 0.1$ and smoothness $\nu = 2$) on a grid of size $n = 60 \times 60 = 3600$. This figure is inspired by Figure 5 in Schafer2020.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Proposition 1: Convergence in KL divergence of Vecchia approximation
  • Proposition 2: Asymptotic equivalence of Vecchia GP prediction
  • Proposition 3: Asymptotic equivalence of MVL estimation
  • Proposition 4: Unbiased Vecchia-estimating function
  • proof : Proof of Proposition \ref{['prop:klconv']}
  • Lemma 1
  • proof : Proof of Lemma \ref{['lemm:regbound']}
  • Lemma 2
  • proof : Proof of Lemma \ref{['lemm:unifconv']}
  • proof : Proof of Proposition \ref{['prop:mvlebound']}
  • ...and 1 more