Gaussian Process-Gated Hierarchical Mixtures of Experts

Yuhao Liu; Marzieh Ajirak; Petar Djuric

Gaussian Process-Gated Hierarchical Mixtures of Experts

Yuhao Liu, Marzieh Ajirak, Petar Djuric

TL;DR

These GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes, and the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks.

Abstract

In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs). Unlike other mixtures of experts with gating models linear in the input, our model employs gating functions built with Gaussian processes (GPs). These processes are based on random features that are non-linear functions of the inputs. Furthermore, the experts in our model are also constructed with GPs. The optimization of the GPHMEs is performed by variational inference. The proposed GPHMEs have several advantages. They outperform tree-based HME benchmarks that partition the data in the input space, and they achieve good performance with reduced complexity. Another advantage is the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes.

Gaussian Process-Gated Hierarchical Mixtures of Experts

TL;DR

Abstract

Paper Structure (11 sections, 27 equations, 4 figures, 8 tables)

This paper contains 11 sections, 27 equations, 4 figures, 8 tables.

Introduction
Background
Random Feature Expansions for Gaussian processes
Gaussian Process-Gated Hierarchical Mixtures of Experts
Variational Inference
Numerical Experiments
Interpretability and explainability with GPHMEs
Discussion on hyperparameters
UCI Data Sets
Large-Scale Data Sets
Summary

Figures (4)

Figure 1: A GPHME with a fixed tree structure, comprising expert leaves (depicted as shaded circles) and inner nodes (depicted as circles). The edges represent RF-based decision rules associated with the inner nodes, and the $Q$s denote the conditional distributions over the target variable $y$.
Figure 2: A visualization of a GPHME of depth four trained on the MNIST data set. The final most likely classifications are shown at each leaf with its average probability over samples. The classes annotated at each inner node are traced backward from the leaves to the root. A leaf does not only predict one class but predicts all the classes. If for example, there are 100 samples and a certain leaf predicts 80 of them as digit 0 while 20 of them as any digit from 1-9, then the leaf in the figure is annotated as digit 0 with p=0.8. The classes written in the inner nodes are sourced from the leaves backward layer by layer because the predictions occur at the leaves. Our model is a "soft" decision tree, and the paths have probabilities, which entails that a digit could "go" both left and right.
Figure 3: Convergence of RMSEs in the regression case, error rates in the classification case, and mean negative log-likelihoods (MNLLs) over time.
Figure 4: Evolution of RMSEs in the regression case, error rates in the classification case, and mean negative log-likelihoods (MNLLs) over time. The x-axes of the MNLL panels are different from those of the RMSE and ER panels because the MNLLs converge later than the RMSEs and ERs.

Gaussian Process-Gated Hierarchical Mixtures of Experts

TL;DR

Abstract

Gaussian Process-Gated Hierarchical Mixtures of Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)