Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Alexander Terenin; David R. Burt; Artem Artemev; Seth Flaxman; Mark van der Wilk; Carl Edward Rasmussen; Hong Ge

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge

TL;DR

Building on stability theory originally developed in the interpolation literature, sufficient and in certain cases necessary conditions on the inducing points are derived for the computations performed to be numerically stable and an automated method for computing inducing points satisfying these conditions is proposed.

Abstract

Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

TL;DR

Abstract

Paper Structure (28 sections, 25 theorems, 112 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 25 theorems, 112 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Gaussian Processes
Inducing Points
Gaussian Processes for Geospatial Modelling
Numerical Stability
Instability in Gaussian Process Models
Numerical Stability in Scalable Gaussian Process Approximations
Numerical Stability in Sparse Approximations via Minimum Separation
Inducing Points for Geospatial Data via Cover Trees
Additional Stability via the Clustered-data Inducing Point Approximation
Experiments
Spatial Resolution and Empirical Approximation Error
Comparison to Alternative Inducing Point Selection Schemes
Reliability in Floating Point Precision: Geospatial Illustrative Example
Numerical Stability and Dataset Size
...and 13 more sections

Key Result

Proposition 0

Let $k$ be a continuous stationary kernel, and let the entries in $\v{x}$ be independently sampled from a uniform distribution on some compact subset $X \subseteq \R^d$. Define the covariance operator acting on the Hilbert space of (equivalence classes of) square-integrable functions. Then the eigenvalues of $\c{K}$ are countable, non-negative, admit a maximum, and can be ordered to form a non-in

Figures (8)

Figure 1: Here we illustrate the condition numbers and resulting linear system error for the Kac--Murdock--Szegö matrix, which is the kernel matrix of an exponential kernel on a regularly-spaced one-dimensional grid. We vary the spacing $r$ and respective correlation $\rho = \exp(-r)$ of neighboring points, and the size $n$ of the matrix, with $\rho = 0.999$ used in cases where $n$ varies, and $n = 256$ used in cases where $\rho$ varies. We plot the condition number, computed numerically using an eigenvalue factorization, along with its theoretical lower and upper bounds. Then, we generate random vectors $\v{v}\~[N](\v{0},\m{I})$, compute $\v{u} = \m{K}_{\v{x}\v{x}} \v{v}$, solve for $\v{v} = \m{K}_{\v{x}\v{x}}^{-1} \v{u}$ numerically, and plot the median error norm over $10^4$ samples, along with 25% and 75% quantiles. We observe that condition numbers asymptotically grow as $\rho\to1$, but not as $N\to\infty$, and increase hand-in-hand with numerical error.
Figure 2: Two iterations of \ref{['alg:covertree']} on a tree with $L=3$ total levels. Given a region, the algorithm first (a) computes a covering of the region by picking data points one-by-one from those not yet covered. Then, it (b) computes the $R$-neighbors of each node and uses this to efficiently compute a Voronoi tessellation of the region. Finally, the algorithm (c) repeats the process recursively.
Figure 3: The idea behind the clustered-data approximation of \ref{['prop:inducing-point-repr']}. To construct this approximation, we move each data point in $\v{x}$ to its nearest cluster, transforming (a) to (b). We can then reinterpret and represent (b) in a sparse manner by merging repeated data points, replacing $\v{y}$ with the cluster means $\v{u}$, and re-weighting each data point according to cluster size---obtaining (c), which is equal in distribution to (b).
Figure 4: We illustrate how spatial resolution affects approximation accuracy and computational cost. We see that increasing the spatial resolution leads to using fewer inducing points, dropping rapidly in low dimension, and slower in higher dimensions. On the other hand, the error in approximation, as measured by Wasserstein distance between the approximate posterior and exact posterior, increases as the spatial resolution increases. Computational costs follow the opposite relationship: finer spatial resolutions are less stable and more expensive to compute, as evidenced by condition numbers and the number of iterations needed for conjugate gradients to converge.
Figure 5: We compare how different inducing point selection algorithms vary according to performance and stability, using the geospatial East Africa dataset, and higher-dimensional Power dataset. The number of inducing points produced by algorithms whose output is variable is shown rounded to the nearest increment. Note that the computational complexity of these methods differs: cover tree is $\c{O}(N\log N)$, online and the $K$-means variants are $\c{O}(NM)$, partial pivoted Cholesky is $\c{O}(NM^2)$, and optimizing the inducing points is $\c{O}(NM^2)$ per optimization iteration. We see that using more inducing points results in higher performance, but reduced stability, with the precise effect varying according to dataset, dimensionality, and inducing point selection method.
...and 3 more figures

Theorems & Definitions (35)

Proposition 0
Definition 1
Proposition 2
Corollary 3
Corollary 4
Proposition 4
Definition 5
Theorem 6
Corollary 7
Proposition 7
...and 25 more

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

TL;DR

Abstract

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (35)