Table of Contents
Fetching ...

A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes

Marcus M. Noack, Hengrui Luo, Mark D. Risser

TL;DR

This work shows a variety of kernels in action using representative datasets, carefully study their properties, and proposes a new kernel that combines some of the identified advantages of existing kernels.

Abstract

The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably the most important building block of a GP is the kernel function which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.

A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes

TL;DR

This work shows a variety of kernels in action using representative datasets, carefully study their properties, and proposes a new kernel that combines some of the identified advantages of existing kernels.

Abstract

The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably the most important building block of a GP is the kernel function which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.
Paper Structure (34 sections, 32 equations, 14 figures, 2 tables, 3 algorithms)

This paper contains 34 sections, 32 equations, 14 figures, 2 tables, 3 algorithms.

Figures (14)

  • Figure 1: The key concept and essence of non-stationary kernels. A synthetic function --- that is also later used for our computational experiments --- was sampled at 40 equidistant points. The function is comprised of high-frequency regions (far left and right), and near-constant-gradient regions (center, green circle). A Gaussian process (GP) is tasked with interpolating the data using a stationary (top, a) and a non-stationary (bottom, b) kernel. For each case, the function approximation and the prior covariance matrix are presented. While the posterior mean is similar in both cases, the posterior variance differs substantially. Focusing on the central region, the uncertainty increases between data points, even though the function is very well-behaved there. The covariance matrix can deliver clues as to why this might happen. The matrix is constant along diagonals, which translates into uncertainties that depend on the distance from surrounding data points only, independent of where in the domain they are located. The non-stationary kernel has no such restriction and provides more realistic estimates of the uncertainty. The covariance entries are not constant along diagonals but correspond to different regions of the function (blue line connections).
  • Figure 2: Our way of measuring the non-stationarity of a dataset or a synthetic function. When a set of data points is drawn randomly from within the input domain and a GP, using a stationary kernel, is trained via log marginal likelihood maximization after each draw (MLE), the final distribution of the hyperparameters --- here a signal variance and an isotropic length scale --- can be used to measure non-stationarity. This distribution can be visualized via scatter (middle) and violin (right) plots. We assign the variance of these distributions as a numerical measure of stationarity. The stationary function (top) leads to a very narrow distribution of the length scale and the signal variance. For the non-stationary functions (second and third row), the distributions for both hyperparameters are broader and the associated variances are larger. In this figure, the axis scales are kept constant to allow easy comparison of the stationarity properties of the test functions.
  • Figure 3: Test dataset 1 (a) and its non-stationarity measures visualized as distributions (b, c, d) in the hyperparameters of a stationary kernel trained on local subsets of that data. The dataset is derived from a one-dimensional synthetic function (see Equation \ref{['eq:synthfunc']}). Non-stationarity appears to be clearly present in the length scale and the signal variance.
  • Figure 4: Test dataset 2 (a, b) and its non-stationarity measures visualized as distributions in the hyperparameters (c, d, e) of a stationary kernel trained on local subsets of that data. The dataset consists of recorded temperatures across the United States and a period of time. Weak non-stationarity appears to be present in the length scale and the signal variance.
  • Figure 5: Test dataset 3 (a) and its non-stationarity measures visualized as distributions in the hyperparameters (b, c, d) of a stationary kernel trained on local subsets of that data. The dataset consists of analyzed X-ray scattering signals over $[0,1]^3$. Non-stationarity is apparent in both length scale and signal variance.
  • ...and 9 more figures