Aggregation Models with Optimal Weights for Distributed Gaussian Processes

Haoyuan Chen; Rui Tuo

Aggregation Models with Optimal Weights for Distributed Gaussian Processes

Haoyuan Chen, Rui Tuo

TL;DR

This work addresses the scalability of Gaussian process regression in distributed settings by introducing an optimal weights aggregation framework based on OptiCom. It extends OptiCom to both sparse variational GP and exact GP, deriving weights by solving a linear system that captures inter-expert correlations, enabling consistent mean predictions with manageable computational overhead. Empirical results on synthetic data and UCI temporal extrapolation demonstrate improved stability and competitive accuracy compared to PoE, BCM, grBCM, and NPAE, with significantly reduced runtime when the number of experts remains moderate. The work provides practical guidance for deploying distributed GPs at scale, and outlines future directions toward variance consistency and adaptive resource allocation.

Abstract

Gaussian process (GP) models have received increasing attention in recent years due to their superb prediction accuracy and modeling flexibility. To address the computational burdens of GP models for large-scale datasets, distributed learning for GPs are often adopted. Current aggregation models for distributed GPs is not time-efficient when incorporating correlations between GP experts. In this work, we propose a novel approach for aggregated prediction in distributed GPs. The technique is suitable for both the exact and sparse variational GPs. The proposed method incorporates correlations among experts, leading to better prediction accuracy with manageable computational requirements. As demonstrated by empirical studies, the proposed approach results in more stable predictions in less time than state-of-the-art consistent aggregation models.

Aggregation Models with Optimal Weights for Distributed Gaussian Processes

TL;DR

Abstract

Paper Structure (49 sections, 35 equations, 8 figures, 2 tables, 6 algorithms)

This paper contains 49 sections, 35 equations, 8 figures, 2 tables, 6 algorithms.

Introduction
Background
GPs
GP regression
GP training
Sparse variational GP (SVGP)
Distributed GP
Distributed GP training
Aggregated prediction
GP regression with OptiCom
Combination technique (CT)
Optimized combination technique (OptiCom)
GP with OptiCom
Posterior distribution
Optimal coefficients
...and 34 more sections

Figures (8)

Figure 1: Errors and time for posterior mean of zero mean and Matérn 3/2 GP with OptiCom and CT of dimension $d=2$ and level $\eta=1,2,\ldots,7$ tested on the Griewank function griewank1981generalized over $n=2000$ training points and $n_t=1000$ test points. We denote GP with OptiCom by circles and GP with CT by diamonds. Left: Logarithm of $l_2$ errors between GP posteriors and ground truth values averaged over 10 different lengthscales $[0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]$. Right: Logarithm of time taken to compute the posterior mean averaged over $1000$ replications.
Figure 2: Hyperparameter estimates versus the training iterations on the $n=10^4$ dataset.
Figure 3: SVGP posteriors with the different aggregation models on $n=100$ training points in $[-1,1]$ and $n_t=50$ test points in $[-1.2,1.2]$ with the number of experts $M=4$, dimension $d=1$, and the number of inducing variables $m=128$. We denote the aggregated predictions (aggreg) by dashed lines, the full SVGP posteriors (full) by solid lines, and the observations by dots.
Figure 4: Comparison of the aggregation models for Exact GP with the RBF kernel in dimension $d=2$ on $n=10^4$ training points in $[-1,1]^2$ and $n_t=2500$ test points in $[-1.2,1.2]^2$. The number of experts considered are $M=2, 4, 8, 10, 20, 40, 80, 100$. The lower $x$-axis represents the size of the local training dataset $n_i = n / M$, and the upper $x$-axis represents the number of the experts $M$. Left: Logarithm of time for computing the aggregated predictions. Middle Left: RMSE between the aggregated predictions and the ground truth. Middle Right: NLPD between the aggregated predictions and the ground truth. Right: 2-Wasserstein distance to the full GP. Top: Mean of the computational time, RMSE, NLPD, and 2-Wasserstein distance. Bottom: Standard deviation of the computational time, RMSE, NLPD, and 2-Wasserstein distance.
Figure 5: Comparison of the aggregation models for SVGP with the RBF kernel in dimension $d=2$ on $n=10^4$ training points in $[-1,1]^2$, $m=128$ inducing points, and $n_t=2500$ test points in $[-1.2,1.2]^2$. The number of experts considered are $M=2, 4, 8, 10, 20, 40, 80, 100$. The lower $x$-axis represents the size of the local training dataset $n_i = n / M$, and the upper $x$-axis represents the number of the experts $M$. Left: Logarithm of time for computing the aggregated predictions. Middle Left: RMSE between the aggregated predictions and the ground truth. Middle Right: NLPD between the aggregated predictions and the ground truth. Right: 2-Wasserstein distance to the full GP. Top: Mean of the computational time, RMSE, NLPD, and 2-Wasserstein distance. Bottom: Standard deviation of the computational time, RMSE, NLPD, and 2-Wasserstein distance.
...and 3 more figures

Aggregation Models with Optimal Weights for Distributed Gaussian Processes

TL;DR

Abstract

Aggregation Models with Optimal Weights for Distributed Gaussian Processes

Authors

TL;DR

Abstract

Table of Contents

Figures (8)