Table of Contents
Fetching ...

Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes

Talay M Cheema, Carl Edward Rasmussen

TL;DR

Integrated Variational Fourier Features (IFF) deliver scalable Gaussian process inference by averaging Fourier features over disjoint intervals, making the inducing feature cross‑covariances hyperparameter‑independent and precomputable. This yields an O(M^3) per‑iteration cost while supporting a broad class of stationary kernels, enabling fast learning and prediction for low‑dimensional spatial data. The authors establish convergence guarantees for the approximate objective and provide practical guidance on choosing M, z, and ε, supported by synthetic and real‑world experiments that show substantial speedups with competitive predictive performance. While effective in low dimensions, IFF faces exponential scaling with dimension and is limited to stationary priors, pointing to future work on non‑stationary extensions and higher‑dimensional efficiency improvements.

Abstract

Sparse variational approximations are popular methods for scaling up inference and learning in Gaussian processes to larger datasets. For $N$ training points, exact inference has $O(N^3)$ cost; with $M \ll N$ features, state of the art sparse variational methods have $O(NM^2)$ cost. Recently, methods have been proposed using more sophisticated features; these promise $O(M^3)$ cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used. In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary covariance functions. We motivate the method and choice of parameters from a convergence analysis and empirical exploration, and show practical speedup in synthetic and real world spatial regression tasks.

Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes

TL;DR

Integrated Variational Fourier Features (IFF) deliver scalable Gaussian process inference by averaging Fourier features over disjoint intervals, making the inducing feature cross‑covariances hyperparameter‑independent and precomputable. This yields an O(M^3) per‑iteration cost while supporting a broad class of stationary kernels, enabling fast learning and prediction for low‑dimensional spatial data. The authors establish convergence guarantees for the approximate objective and provide practical guidance on choosing M, z, and ε, supported by synthetic and real‑world experiments that show substantial speedups with competitive predictive performance. While effective in low dimensions, IFF faces exponential scaling with dimension and is limited to stationary priors, pointing to future work on non‑stationary extensions and higher‑dimensional efficiency improvements.

Abstract

Sparse variational approximations are popular methods for scaling up inference and learning in Gaussian processes to larger datasets. For training points, exact inference has cost; with features, state of the art sparse variational methods have cost. Recently, methods have been proposed using more sophisticated features; these promise cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used. In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary covariance functions. We motivate the method and choice of parameters from a convergence analysis and empirical exploration, and show practical speedup in synthetic and real world spatial regression tasks.
Paper Structure (35 sections, 7 theorems, 64 equations, 12 figures, 1 table)

This paper contains 35 sections, 7 theorems, 64 equations, 12 figures, 1 table.

Key Result

Lemma 4.1

Under assumptiona A3 and A4,

Figures (12)

  • Figure 1: Illustration of the Integrated Fourier Feature construction. We plot the mean function (dashed), between one and three standard deviations (shaded) and sample functions in both the data and frequency domains for a squared exponential kernel with unit lengthscale. The sample functions in the data and frequency domains correspond to one another. (a) The prior's Fourier transform is white Gaussian noise whose variance is given by the spectral density. (b) We cannot condition meaningfully on come finite collection of frequencies (red stars), as this gives no information about the other frequencies -- the conditional prior $p(f|u)$ in the data domain is unchanged. (c) We show only the inducing values in the frequency domain, which are averages of the surrounding region. The conditional prior is now meaningful, and the residual uncertainty is due to high frequency content not included in the features.
  • Figure 2: Gap between the log marginal likelihood and the training objective ($\mathcal{L}-\mathfrak{F}$) for different settings of $M$, $\varepsilon$ for data sampled from a GP with a Gaussian (left) or Matérn-3/2 (left) kernel. In each case the hyperparameters are set to their groundtruth values, where the lengthscale is $\lambda$. The inputs are samples from a uniform distribution centred on 0 and with width $W_x$. The horizontal line is at $0.95$.
  • Figure 3: Comparing standard sparse Gaussian process regression (black) to IFF (red) for data generated from a prior with Gaussian covariance function in 1D (left) and 2D (right). Lower and the to the left is better. The groundtruth $\mathcal{L}$ is $\mathcal{L}$ evaluated at the groundtruth hyperparameters, whereas in the other rows, $\mathcal{L}$ is evaluated at the learnt hyperparameters. The gaps are normalised by $N$, and execution time is normalised by the longest. The bottom row shows feature efficiency, whereas the upper rows show computational efficiency.
  • Figure 4: As \ref{['fig:se_synth']}, but with the data sampled from a Matérn-5/2 GP. The picture is broadly comparable, but VFF now more closely matches the prior, so the drop in feature efficiency is far less in higher dimensions.
  • Figure 5: Performance curves for real world datasets of increasing size (the top row is the smallest). Lower and to the left is better.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Lemma 4.1
  • Theorem 4.2
  • proof
  • Remark 4.3
  • Theorem 4.4
  • proof
  • Lemma B.1
  • proof
  • Lemma C.1: \ref{['lemma:cov']}
  • proof
  • ...and 5 more