Fearless Stochasticity in Expectation Propagation

Jonathan So; Richard E. Turner

Fearless Stochasticity in Expectation Propagation

Jonathan So, Richard E. Turner

TL;DR

A novel perspective is provided on the moment-matching updates of EP, namely, that they perform natural-gradient-based optimisation of a variational objective, which is particularly well-suited to MC estimation.

Abstract

Expectation propagation (EP) is a family of algorithms for performing approximate inference in probabilistic models. The updates of EP involve the evaluation of moments -- expectations of certain functions -- which can be estimated from Monte Carlo (MC) samples. However, the updates are not robust to MC noise when performed naively, and various prior works have attempted to address this issue in different ways. In this work, we provide a novel perspective on the moment-matching updates of EP; namely, that they perform natural-gradient-based optimisation of a variational objective. We use this insight to motivate two new EP variants, with updates that are particularly well-suited to MC estimation. They remain stable and are most sample-efficient when estimated with just a single sample. These new variants combine the benefits of their predecessors and address key weaknesses. In particular, they are easier to tune, offer an improved speed-accuracy trade-off, and do not rely on the use of debiasing estimators. We demonstrate their efficacy on a variety of probabilistic inference tasks.

Fearless Stochasticity in Expectation Propagation

TL;DR

Abstract

Paper Structure (57 sections, 6 theorems, 62 equations, 9 figures, 2 algorithms)

This paper contains 57 sections, 6 theorems, 62 equations, 9 figures, 2 algorithms.

Introduction
Background
Expectation propagation (EP)
Variational problem
EP updates
Unified EP algorithm
Stochastic moment estimation
Fearlessly stochastic EP algorithms
Natural gradient view of EP
EP-eta
EP-mu
Related work
Evaluation
Hierarchical logistic regression with MVN prior
Hierarchical logistic regression with NIW prior
...and 42 more sections

Key Result

Proposition 1

For $\alpha > 0$, the moment-matching update of EP eqn:ep_inner_update is equivalent to performing an NGD step in $L$ with respect to the mean parameters of $\cramped{\tilde{p}_i}^{(t)}$ with step size $\alpha^{-1}$. That is, for $\mu_i = \mathbb{E}_{\cramped{\tilde{p}_i}^{(t)}(z)}[s(z)]$, and ${\cr

Figures (9)

Figure 1: The effect of step size ($\alpha$ or $\epsilon$) and number of MC samples ($n_\text{samp}$) on different EP variants in a stochastic version of the clutter problem of minka2001family. EP (naive) uses maximum likelihood estimation for the updates, and EP (debiased) uses the estimator of xu2014sms. Step size corresponds to $\alpha$ for EP, and $\epsilon$ for EP-$\mu$ and EP-$\eta$. Only EP-$\eta$ and EP-$\mu$ can perform $1$-sample updates, hence the other traces are not visible. The left panel shows the expected decrease in $L$ after $100/n_\text{samp}$ steps. Performing e.g. $100\times$$1$-sample steps, or $10\times$$10$-sample steps, achieves a much larger decrease in $L$ than a single $100$-sample step. The right panel shows the magnitude of the bias in $\lambda_i$ after a single parallel update, averaged over all sites and dimensions. The bias of EP-$\mu$ shrinks far faster as the step size decreases than that of EP. EP-$\eta$ is always unbiased and so is not visible.
Figure 2: Pareto frontiers showing the number of NUTS steps ($x$-axis) against the KL divergence from $p$ to an estimate of the optimum ($y$-axis). Each point on the plot marks the lowest average KL divergence attained by any hyperparameter setting by that step count. Error bars mark the full range of values for the marked hyperparameter setting across 5 random seeds.
Figure 3: Comparison of EP-$\eta$ with conjugate-computation variational inference (CVI) on a hierarchical logistic regression model. The two leftmost plots show forward (solid) and reverse (dashed) KL divergences between the approximation of each method and a MVN distribution estimated directly from MCMC samples. The left panel shows a comparison with respect to wall-clock time, when NUTS is used as the underlying sampling kernel for EP-$\eta$. The left-middle panel shows a similar comparison, but with respect to the number of samples drawn, and using an "oracle" sampling kernel for EP-$\eta$. Hyperparameters were tuned for each method and shaded regions show the range of trajectories across 5 random seeds. The right and right-middle panels show pairwise marginals of the various MVN approximations overlaid on contours of the true posterior. Coloured dots and ellipses correspond to means and 2-standard-deviation contours, respectively. See Section \ref{['sec:limitations']} for discussion of these results, and Appendix \ref{['app:cvicomparison']} for further details.
Figure 4: Directed graphical model for the experiments of Section \ref{['sec:evaluation']}.
Figure 5: The effect of varying EP hyperparameters. Partial Pareto frontiers show the number of NUTS steps ($x$-axis) against the KL divergence from $p$ to an estimate of the optimum ($y$-axis).
...and 4 more figures

Theorems & Definitions (9)

Proposition 1
Proposition 2
Proposition 3
Proposition 3
proof
Proposition 3
proof
Proposition 3
proof

Fearless Stochasticity in Expectation Propagation

TL;DR

Abstract

Fearless Stochasticity in Expectation Propagation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)