Mixtures of Gaussian Process Experts with SMC$^2$
Teemu Härkönen, Sara Wade, Kody Law, Lassi Roininen
TL;DR
This paper tackles the cubic-time complexity of Gaussian processes by modeling data with a discriminative mixture of GP experts (MoE) whose data are partitioned among $K$ local GP experts via a gating network. It introduces a nested SMC$^2$ inference scheme that jointly infers the gating network parameters $\Psi$ and the GP parameters $\Theta$ by tempering likelihoods and using an inner SMC to marginalize the GP parameters for each partition, while an outer SMC (with PMMH steps) samples the partition and gating network. The approach yields robust, fully Bayesian inference for non-stationary, heteroscedastic, or discontinuous data and remains parallelizable; it demonstrates improved predictive performance and uncertainty quantification over importance sampling on synthetic and real datasets, with insights into partition uncertainty via posterior similarity matrices. The method offers a principled, scalable framework for flexible MoEs with GP experts, with potential extensions to alternative gating networks and online or inverse problems.
Abstract
Gaussian processes are a key component of many flexible statistical and machine learning models. However, they exhibit cubic computational complexity and high memory constraints due to the need of inverting and storing a full covariance matrix. To circumvent this, mixtures of Gaussian process experts have been considered where data points are assigned to independent experts, reducing the complexity by allowing inference based on smaller, local covariance matrices. Moreover, mixtures of Gaussian process experts substantially enrich the model's flexibility, allowing for behaviors such as non-stationarity, heteroscedasticity, and discontinuities. In this work, we construct a novel inference approach based on nested sequential Monte Carlo samplers to simultaneously infer both the gating network and Gaussian process expert parameters. This greatly improves inference compared to importance sampling, particularly in settings when a stationary Gaussian process is inappropriate, while still being thoroughly parallelizable.
