MCMC-driven learning

Alexandre Bouchard-Côté; Trevor Campbell; Geoff Pleiss; Nikola Surjanovic

MCMC-driven learning

Alexandre Bouchard-Côté, Trevor Campbell, Geoff Pleiss, Nikola Surjanovic

TL;DR

This work introduces Markovian optimization-integration (MOI), a unifying framework for problems where both the target distribution and the sampling kernel depend on parameters that are learned from the Markov chain, formalized as solving $g(\\phi)=\\mathbb{E}_{\\pi_\\phi}[g(X,\\phi)]=0$ or minimizing $f(\\phi)=\\mathbb{E}_{\\pi_\\phi}[f(X,\\phi)]$. It shows that a wide range of MCMC/ML tasks—such as forward and reverse KL variational inference, adaptive MCMC, transport-assisted MCMC, surrogate-based inference, coreset MCMC, and Markov chain gradient descent—fit MOI and can be translated across methods within a common theoretical umbrella. The chapter surveys gradient estimation strategies (reparameterization and REINFORCE), automatic differentiation, mini-batching, and stabilization techniques, and develops a convergence theory under deterministic, independent-noise, and Markovian-noise assumptions, including confinement and variance-reduction considerations. It culminates with a case-study-focused discussion of distribution approximation via forward KL minimization, tempering, and approximate transport maps, illustrating scalable MOI for learning expressive proposals and accelerating MCMC under big-data regimes.

Abstract

This paper is intended to appear as a chapter for the Handbook of Markov Chain Monte Carlo. The goal of this chapter is to unify various problems at the intersection of Markov chain Monte Carlo (MCMC) and machine learning$\unicode{x2014}$which includes black-box variational inference, adaptive MCMC, normalizing flow construction and transport-assisted MCMC, surrogate-likelihood MCMC, coreset construction for MCMC with big data, Markov chain gradient descent, Markovian score climbing, and more$\unicode{x2014}$within one common framework. By doing so, the theory and methods developed for each may be translated and generalized.

MCMC-driven learning

TL;DR

or minimizing

. It shows that a wide range of MCMC/ML tasks—such as forward and reverse KL variational inference, adaptive MCMC, transport-assisted MCMC, surrogate-based inference, coreset MCMC, and Markov chain gradient descent—fit MOI and can be translated across methods within a common theoretical umbrella. The chapter surveys gradient estimation strategies (reparameterization and REINFORCE), automatic differentiation, mini-batching, and stabilization techniques, and develops a convergence theory under deterministic, independent-noise, and Markovian-noise assumptions, including confinement and variance-reduction considerations. It culminates with a case-study-focused discussion of distribution approximation via forward KL minimization, tempering, and approximate transport maps, illustrating scalable MOI for learning expressive proposals and accelerating MCMC under big-data regimes.

Abstract

which includes black-box variational inference, adaptive MCMC, normalizing flow construction and transport-assisted MCMC, surrogate-likelihood MCMC, coreset construction for MCMC with big data, Markov chain gradient descent, Markovian score climbing, and more

within one common framework. By doing so, the theory and methods developed for each may be translated and generalized.

Paper Structure (45 sections, 97 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 45 sections, 97 equations, 7 figures, 1 table, 2 algorithms.

Introduction
Examples of MOI problems
Forward KL variational inference
Reverse KL variational inference
Adaptive MCMC
Transport-assisted MCMC
Surrogate-based inference
Coreset MCMC
Markov chain gradient descent
Strategies for MOI problems
Stochastic gradient estimation
Reparameterization trick
REINFORCE
Automatic differentation
Mini-batching
...and 30 more sections

Figures (7)

Figure 1: Examples of failures of the adaptive algorithm \ref{['eq:rm']} in a deterministic setting. Orange arrow segments depict the sequence of iterates. \ref{['fig:wiggle']}: $f$ is given by \ref{['eq:wiggle']}; the adaptive algorithm is led away from a solution by a decaying right tail. \ref{['fig:stuck_quadratic']}: $f$ is given by \ref{['eq:quadratic']}; the step size sequence decays too quickly, and the adaptive algorithm gets stuck. \ref{['fig:diverge_quadratic']}: $f$ is given by \ref{['eq:quadratic']}; the step size sequence is too aggressive, and the adaptive algorithm becomes unstable. \ref{['fig:cosh']}: $f$ is given by \ref{['eq:cosh']}; the step size sequence is reasonable, but the function $f$ is not Lipschitz smooth, leading again to instability.
Figure 2: An example of a failure of the adaptive algorithm \ref{['eq:rm']} due to noise. Orange arrow segments depict the sequence of iterates. The objective $f$ is given by \ref{['eq:quadratic_again']}. Because the step size sequence is a constant, the perturbations due to noise do not decay adequately and the sequence does not converge.
Figure 3: An example of a failure of the adaptive algorithm \ref{['eq:rm']} due to non-uniform kernel mixing behaviour. Orange arrow segments depict the sequence of iterates. The objective $f$ is given by \ref{['eq:strcvx']}. The algorithm is initialized in the $+$ state (green dashed line), and so the iterates follow the negative gradient and proceed in the positive direction. However, the kernel mixes increasingly more slowly for parameters of larger magnitude; eventually the kernel becomes stuck in the $+$ state and the iterates diverge to $+\infty$.
Figure 4: Visualization of the bridge SDE posterior distribution used to illustrate the various algorithms in this section. The SDE is a Wright--Fisher diffusion on $(0, 1)$. Left: Distribution over SDE paths (red) bridging two fixed anchors and avoiding a hard constraint (thick black line). Middle: Pairwise distribution of variables in the neighborhood of the hard constraint showing multimodality. Right: Pair plot of all pairwise posterior distributions.
Figure 5: Online learning of a normal distribution with mean $0.2$ and variance $0.4$ from draws.
...and 2 more figures

MCMC-driven learning

TL;DR

Abstract

MCMC-driven learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)