Table of Contents
Fetching ...

Fast Deep Mixtures of Gaussian Process Experts

Clement Etienam, Kody Law, Sara Wade, Vitaly Zankin

TL;DR

The work addresses flexible conditional density estimation under non-stationarity and heteroscedasticity by combining GP experts with a deep neural gating network within a mixture-of-experts framework. It introduces a fast one-pass Cluster-Classify-Regress (CCR) approximation and a rigorous MM-based MAP inference approach, linking the two as a practical pathway to accurate, uncertainty-aware predictions on large, high-dimensional datasets. Empirical results across diverse benchmarks show improved accuracy and well-calibrated uncertainty with notably reduced computation, especially for big data regimes like the χ_150k dataset. The method emphasizes scalability, robust uncertainty quantification, and potential applicability to broader MoE architectures and infinite mixtures in future work.

Abstract

Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, allowing not only the mean function but the entire density of the output to change with the inputs. Sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models, and in this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.

Fast Deep Mixtures of Gaussian Process Experts

TL;DR

The work addresses flexible conditional density estimation under non-stationarity and heteroscedasticity by combining GP experts with a deep neural gating network within a mixture-of-experts framework. It introduces a fast one-pass Cluster-Classify-Regress (CCR) approximation and a rigorous MM-based MAP inference approach, linking the two as a practical pathway to accurate, uncertainty-aware predictions on large, high-dimensional datasets. Empirical results across diverse benchmarks show improved accuracy and well-calibrated uncertainty with notably reduced computation, especially for big data regimes like the χ_150k dataset. The method emphasizes scalability, robust uncertainty quantification, and potential applicability to broader MoE architectures and infinite mixtures in future work.

Abstract

Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, allowing not only the mean function but the entire density of the output to change with the inputs. Sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models, and in this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.

Paper Structure

This paper contains 16 sections, 35 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Motivating toy example. Comparison of the true and estimated allocations (first row) and predictions (second row) for the proposed DNN gating network with GP experts (second column), logistic regression (LR) gating network with GP experts (third column), and DNN gating network and experts (fourth column). The DNN gating network recovers the true allocations, and combined with GP experts leads to improved accuracy and uncertainty (details in \ref{['sec:simulations']}), especially for outlying test points.
  • Figure 2: Input distribution marginals of a $\chi$ dataset
  • Figure 3: (a) Predictions (based on soft and hard allocations) with our model for the Motorcycle dataset, with two standard deviations and soft allocation based density estimates; (b) a slice representing the density estimate given $x^* = -0.478$.
  • Figure 4: Left: Accuracy vs Time ( on log-scale -- note purple/circled data). CCR delivers comparable/higher accuracy, with comparable/smaller cost. Right: Empirical coverage vs Average length of 95% CIs. CCR provides judicious UQ.
  • Figure 5: Heat map of the conditional density for Motorcycle data.
  • ...and 1 more figures