Table of Contents
Fetching ...

Massively Parallel Expectation Maximization For Approximate Posteriors

Thomas Heap, Sam Bowyer, Laurence Aitchison

TL;DR

This work tackles scalable Bayesian inference for large hierarchical models by introducing QEM, a gradient-free EM-like procedure that learns an approximate posterior from massively parallel posterior moments. The E-step uses massively parallel importance weighting (MPIW) to estimate true posterior moments, and the M-step updates simple exponential-family posteriors to match those moments, stabilized by an exponential moving average. QEM is shown to outperform gradient-based massively parallel VI and RWS in ELBO, predictive log-likelihood, and moment accuracy, while also being invariant to reparameterizations. The approach enables fast, robust inference on diverse datasets and suggests a path toward gradient-free, scalable probabilistic programming.

Abstract

Bayesian inference for hierarchical models can be very challenging. MCMC methods have difficulty scaling to large models with many observations and latent variables. While variational inference (VI) and reweighted wake-sleep (RWS) can be more scalable, they are gradient-based methods and so often require many iterations to converge. Our key insight was that modern massively parallel importance weighting methods (Bowyer et al., 2024) give fast and accurate posterior moment estimates, and we can use these moment estimates to rapidly learn an approximate posterior. Specifically, we propose using expectation maximization to fit the approximate posterior, which we call QEM. The expectation step involves computing the posterior moments using high-quality massively parallel estimates from Bowyer et al. (2024). The maximization step involves fitting the approximate posterior using these moments, which can be done straightforwardly for simple approximate posteriors such as Gaussian, Gamma, Beta, Dirichlet, Binomial, Multinomial, Categorical, etc. (or combinations thereof). We show that QEM is faster than state-of-the-art, massively parallel variants of RWS and VI, and is invariant to reparameterizations of the model that dramatically slow down gradient based methods.

Massively Parallel Expectation Maximization For Approximate Posteriors

TL;DR

This work tackles scalable Bayesian inference for large hierarchical models by introducing QEM, a gradient-free EM-like procedure that learns an approximate posterior from massively parallel posterior moments. The E-step uses massively parallel importance weighting (MPIW) to estimate true posterior moments, and the M-step updates simple exponential-family posteriors to match those moments, stabilized by an exponential moving average. QEM is shown to outperform gradient-based massively parallel VI and RWS in ELBO, predictive log-likelihood, and moment accuracy, while also being invariant to reparameterizations. The approach enables fast, robust inference on diverse datasets and suggests a path toward gradient-free, scalable probabilistic programming.

Abstract

Bayesian inference for hierarchical models can be very challenging. MCMC methods have difficulty scaling to large models with many observations and latent variables. While variational inference (VI) and reweighted wake-sleep (RWS) can be more scalable, they are gradient-based methods and so often require many iterations to converge. Our key insight was that modern massively parallel importance weighting methods (Bowyer et al., 2024) give fast and accurate posterior moment estimates, and we can use these moment estimates to rapidly learn an approximate posterior. Specifically, we propose using expectation maximization to fit the approximate posterior, which we call QEM. The expectation step involves computing the posterior moments using high-quality massively parallel estimates from Bowyer et al. (2024). The maximization step involves fitting the approximate posterior using these moments, which can be done straightforwardly for simple approximate posteriors such as Gaussian, Gamma, Beta, Dirichlet, Binomial, Multinomial, Categorical, etc. (or combinations thereof). We show that QEM is faster than state-of-the-art, massively parallel variants of RWS and VI, and is invariant to reparameterizations of the model that dramatically slow down gradient based methods.

Paper Structure

This paper contains 36 sections, 2 theorems, 69 equations, 10 figures, 1 table, 1 algorithm.

Key Result

theorem 1

Consider an exponential moving average moment estimator of the form Eq. eq:ema, where $m^\text{one iter}_t$ is an unbiased estimator with finite variance and where with $0 < p < 1$. In the limit as $t \rightarrow \infty$, $m_t$ is unbiased, and zero variance,

Figures (10)

  • Figure 1: Comparing the ELBO (top row) and predictive-log-likelihood (bottom row) of QEM (pink), RWS (green) and VI (orange) on several models, with iteration number on the x-axis. We report error bars on each line of one standard error over five repeated runs with the same data but using different random seeds. Note we did not run VI on the occupancy model as it has discrete latent variables.
  • Figure 2: As in Figure \ref{['fig:model_summary']}, but with time, rather than iterations on the x-axis. Again, note we cannot run VI on the occupancy model as it has discrete latent variables.
  • Figure 3: Comparing the time-per-iteration between QEM (pink), RWS (green) and VI (orange) on several models with varying values of K. Black error bars represent one standard deviation over all iterations and experiment repeats.
  • Figure 4: Mean squared error between first moment estimates for each method and HMC first moment estimates plotted against time. Occupancy is not plotted because it has discrete latent variables, preventing the use of HMC, and Covid is not plotted because we were not able to scale HMC to this larger model.
  • Figure 5: Comparing the ELBOs achieved by each method on reparameterized models (coloured lines) versus the original parameterization (black lines), with error bars representing standard errors over five runs with the same data but using different random seeds. In many cases, the reparameterization led to a different learning rate being optimal for MP VI and MP RWS. Where this occurred, we have plotted the learning rate that was optimal for the original parameterization in a solid colour and the learning rate that was optimal for the reparameterization in a fainter colour.
  • ...and 5 more figures

Theorems & Definitions (3)

  • theorem 1
  • theorem 2
  • proof