Table of Contents
Fetching ...

Mixtures of Gaussian Process Experts with SMC$^2$

Teemu Härkönen, Sara Wade, Kody Law, Lassi Roininen

TL;DR

This paper tackles the cubic-time complexity of Gaussian processes by modeling data with a discriminative mixture of GP experts (MoE) whose data are partitioned among $K$ local GP experts via a gating network. It introduces a nested SMC$^2$ inference scheme that jointly infers the gating network parameters $\Psi$ and the GP parameters $\Theta$ by tempering likelihoods and using an inner SMC to marginalize the GP parameters for each partition, while an outer SMC (with PMMH steps) samples the partition and gating network. The approach yields robust, fully Bayesian inference for non-stationary, heteroscedastic, or discontinuous data and remains parallelizable; it demonstrates improved predictive performance and uncertainty quantification over importance sampling on synthetic and real datasets, with insights into partition uncertainty via posterior similarity matrices. The method offers a principled, scalable framework for flexible MoEs with GP experts, with potential extensions to alternative gating networks and online or inverse problems.

Abstract

Gaussian processes are a key component of many flexible statistical and machine learning models. However, they exhibit cubic computational complexity and high memory constraints due to the need of inverting and storing a full covariance matrix. To circumvent this, mixtures of Gaussian process experts have been considered where data points are assigned to independent experts, reducing the complexity by allowing inference based on smaller, local covariance matrices. Moreover, mixtures of Gaussian process experts substantially enrich the model's flexibility, allowing for behaviors such as non-stationarity, heteroscedasticity, and discontinuities. In this work, we construct a novel inference approach based on nested sequential Monte Carlo samplers to simultaneously infer both the gating network and Gaussian process expert parameters. This greatly improves inference compared to importance sampling, particularly in settings when a stationary Gaussian process is inappropriate, while still being thoroughly parallelizable.

Mixtures of Gaussian Process Experts with SMC$^2$

TL;DR

This paper tackles the cubic-time complexity of Gaussian processes by modeling data with a discriminative mixture of GP experts (MoE) whose data are partitioned among local GP experts via a gating network. It introduces a nested SMC inference scheme that jointly infers the gating network parameters and the GP parameters by tempering likelihoods and using an inner SMC to marginalize the GP parameters for each partition, while an outer SMC (with PMMH steps) samples the partition and gating network. The approach yields robust, fully Bayesian inference for non-stationary, heteroscedastic, or discontinuous data and remains parallelizable; it demonstrates improved predictive performance and uncertainty quantification over importance sampling on synthetic and real datasets, with insights into partition uncertainty via posterior similarity matrices. The method offers a principled, scalable framework for flexible MoEs with GP experts, with potential extensions to alternative gating networks and online or inverse problems.

Abstract

Gaussian processes are a key component of many flexible statistical and machine learning models. However, they exhibit cubic computational complexity and high memory constraints due to the need of inverting and storing a full covariance matrix. To circumvent this, mixtures of Gaussian process experts have been considered where data points are assigned to independent experts, reducing the complexity by allowing inference based on smaller, local covariance matrices. Moreover, mixtures of Gaussian process experts substantially enrich the model's flexibility, allowing for behaviors such as non-stationarity, heteroscedasticity, and discontinuities. In this work, we construct a novel inference approach based on nested sequential Monte Carlo samplers to simultaneously infer both the gating network and Gaussian process expert parameters. This greatly improves inference compared to importance sampling, particularly in settings when a stationary Gaussian process is inappropriate, while still being thoroughly parallelizable.
Paper Structure (11 sections, 45 equations, 15 figures, 3 tables, 3 algorithms)

This paper contains 11 sections, 45 equations, 15 figures, 3 tables, 3 algorithms.

Figures (15)

  • Figure 1: An example of partitioned data with four clusters. The clustered data points are shown in blue, red, green, and gray for each cluster with their respective GP mean and 95% pointwise credible intervals in the corresponding color.
  • Figure 2: Plate diagram showing relations between the data and model parameters for MoE.
  • Figure 3: On top left, the posterior distribution $p(\Theta \mid Y,X)$ in Eq. \ref{['eq:partitionLikelihood']} as a function of the noise standard deviation $\sigma_{1,\varepsilon}$ and length scale $l_{1,1}$ for 7 data points from a single zero-mean Gaussian process expert with $\left( \sigma_{1,\varepsilon}, \sigma_{1,f}, l_{1,1} \right) = \left( 0.1^{1/2}, 1, 1 \right)$. The local maxima of the posterior distribution are illustrated with blue and red stars. On top right, the 7 data points in blue together with the true predictive mean and 95% interval in black and gray. Below, from left to right, the predictive mean and 95% intervals corresponding to the short length scale local minimum, long length scale local minimum, and marginalization over the posterior distribution $p(\Theta \mid Y, X)$.
  • Figure 4: The marginalized likelihood estimates $p_\mathrm{MAP}( Y \mid X)$, $p_\mathrm{LM}( Y \mid X)$, and $p( Y \mid X)$ using the MAP estimate (blue line), long length scale local minimum (red line), and high-accuracy numerical integration (dashed red lines), respectively. The MAP estimate leads to overfitting and high likelihood for the training data. The mean $p_\mathrm{SMC}( Y \mid X)$ and 95% confidence intervals given by repeated SMC runs with different particle amounts $M$ shown in black and gray, respectively, match closely with numerical integration.
  • Figure 5: An illustration of the inner SMC sampler ${\Upsilon}( \Theta_{1:M}^{(0:t)}, A^{(t)} \mid C, X, Y)$. The SMC sampler uses $M$ particles $\Theta_{1:M}^{(0:t)}$ to construct an approximation for the posterior distribution of the GP expert parameters at each iteration corresponding to the current level of tempering $\kappa^{(t)}$. The inner SMC sampler provides an unbiased estimator for the marginal likelihood $p_t( Y \mid X, C)$ in Eq. \ref{['eq:temperedMarginalLikelihood']} according to Eq. \ref{['eq:Z_tM']}. The auxiliary parameters $A^{(t)}$ appear in the mutation and selection steps and are not used elsewhere or retained.
  • ...and 10 more figures