Table of Contents
Fetching ...

Reinforcement Learning for Adaptive MCMC

Congye Wang, Wilson Chen, Heishiro Kanagawa, Chris. J. Oates

TL;DR

This work reframes adaptive MCMC as a reinforcement-learning task by introducing Reinforcement Learning Metropolis--Hastings (RLMH), which learns a state-dependent MH proposal via a neural-network-parameterized map φ and policy-gradient optimization. The authors prove ergodicity under diminishing adaptation and gradient clipping, establishing p-invariance for φ-MH and showing the adaptive chain converges to the target. They implement a gradient-free variant using deterministic policy gradient (DDPG) and demonstrate strong empirical performance on the PosteriorDB benchmark, often outperforming traditional gradient-free adaptive MCMC algorithms. The study highlights a general, theoretically sound pathway for applying RL to adaptive MCMC and suggests avenues for extending the approach to gradient-based proposals and related Monte Carlo methods.

Abstract

An informal observation, made by several authors, is that the adaptive design of a Markov transition kernel has the flavour of a reinforcement learning task. Yet, to-date it has remained unclear how to actually exploit modern reinforcement learning technologies for adaptive MCMC. The aim of this paper is to set out a general framework, called Reinforcement Learning Metropolis--Hastings, that is theoretically supported and empirically validated. Our principal focus is on learning fast-mixing Metropolis--Hastings transition kernels, which we cast as deterministic policies and optimise via a policy gradient. Control of the learning rate provably ensures conditions for ergodicity are satisfied. The methodology is used to construct a gradient-free sampler that out-performs a popular gradient-free adaptive Metropolis--Hastings algorithm on $\approx 90 \%$ of tasks in the PosteriorDB benchmark.

Reinforcement Learning for Adaptive MCMC

TL;DR

This work reframes adaptive MCMC as a reinforcement-learning task by introducing Reinforcement Learning Metropolis--Hastings (RLMH), which learns a state-dependent MH proposal via a neural-network-parameterized map φ and policy-gradient optimization. The authors prove ergodicity under diminishing adaptation and gradient clipping, establishing p-invariance for φ-MH and showing the adaptive chain converges to the target. They implement a gradient-free variant using deterministic policy gradient (DDPG) and demonstrate strong empirical performance on the PosteriorDB benchmark, often outperforming traditional gradient-free adaptive MCMC algorithms. The study highlights a general, theoretically sound pathway for applying RL to adaptive MCMC and suggests avenues for extending the approach to gradient-based proposals and related Monte Carlo methods.

Abstract

An informal observation, made by several authors, is that the adaptive design of a Markov transition kernel has the flavour of a reinforcement learning task. Yet, to-date it has remained unclear how to actually exploit modern reinforcement learning technologies for adaptive MCMC. The aim of this paper is to set out a general framework, called Reinforcement Learning Metropolis--Hastings, that is theoretically supported and empirically validated. Our principal focus is on learning fast-mixing Metropolis--Hastings transition kernels, which we cast as deterministic policies and optimise via a policy gradient. Control of the learning rate provably ensures conditions for ergodicity are satisfied. The methodology is used to construct a gradient-free sampler that out-performs a popular gradient-free adaptive Metropolis--Hastings algorithm on of tasks in the PosteriorDB benchmark.
Paper Structure (45 sections, 13 theorems, 72 equations, 6 figures, 1 table, 5 algorithms)

This paper contains 45 sections, 13 theorems, 72 equations, 6 figures, 1 table, 5 algorithms.

Key Result

Lemma 1

Let $\phi$ be continuous, and let both $x \mapsto p(x)$ and $(\varphi,x,y) \mapsto q_\varphi(x,y)$ be positive and continuous. Then $\phi$-MH is $p$-invariant and ergodic.

Figures (6)

  • Figure 1: rlmh, illustrated. Here the task is to sample from the Gaussian mixture model $p(\cdot)$ whose equally-weighted components are $\mathcal{N}(\pm 5,1)$. Left: The reward sequence $(r_n)_{n \geq 0}$, where $r_n$ is the logarithm of the expected squared jump distance corresponding to iteration $n$ of rlmh. Middle: Proposal mean functions $x \mapsto \phi(x)$, at initialisation in 0, and corresponding to the rewards indicated in 1 and 2. Right: The density $p(\cdot)$, and histograms of the last $n = 5,000$ samples produced using mala, nuts, and rlmh. [A smoothing window of length 5 was applied to the reward sequence to improve clarity of this plot.]
  • Figure 2: Investigating sensitivity to the architecture of the neural network $\phi$ in rlmh. For the experiment presented in \ref{['fig: illustration']} of the main text we employed a two layer (i.e. $h = 1$ hidden layer) neural network with width $w = 32$. The same experiment was performed with the architecture dimensions $(h,w)$ changed to (a) (1,16), (b) (1,64), (c) (1,256), (d) (2,32), and (e) (3,32); in all cases similar conclusions were obtained. [The colour convention and the interpretation of each panel is identical to that of \ref{['fig: illustration']} in the main text.]
  • Figure 3: Investigating sensitivity to the architecture of the neural network $\phi$ in rlmh, continued.
  • Figure 4: rlmh, illustrated. Here we considered (a) a skewed target, (b) a skewed multimodal target, and (c) an unequally-weighted mixture model target. Left: The reward sequence $(r_n)_{n \geq 0}$, where $r_n$ is the logarithm of the expected squared jump distance corresponding to iteration $n$ of rlmh. Middle: Proposal mean functions $x \mapsto \phi(x)$, at initialisation (top), and corresponding to the rewards indicated in red (middle) and blue (bottom). Right: The density $p(\cdot)$, and histograms of the last $n = 5,000$ samples produced using mala, nuts, and rlmh. [A smoothing window of length 5 was applied to the reward sequence to improve clarity of this plot.]
  • Figure 5: rlmh, illustrated, continued. Here we considered (a) a skewed target, (b) a skewed multimodal target, and (c) an unequally-weighted mixture model target. Left: The reward sequence $(r_n)_{n \geq 0}$, where $r_n$ is the logarithm of the expected squared jump distance corresponding to iteration $n$ of rlmh. Middle: Proposal mean functions $x \mapsto \phi(x)$, at initialisation (top), and corresponding to the rewards indicated in red (middle) and blue (bottom). Right: The density $p(\cdot)$, and histograms of the last $n = 5,000$ samples produced using mala, nuts, and rlmh. [A smoothing window of length 5 was applied to the reward sequence to improve clarity of this plot.]
  • ...and 1 more figures

Theorems & Definitions (32)

  • Example 1: $\phi$-MH based on symmetric random walk proposal
  • Example 2: $\phi$-MH based on independence sampler proposal
  • Lemma 1: Ergodicity of $\phi$-MH; $\mathcal{X} = \mathbb{R}^d$
  • Remark 1: Choice of action
  • Remark 2: Choice of reward
  • Remark 3: On-policy requirement
  • Definition 1: SSAGE; roberts2007coupling
  • Theorem 1: SSAGE for general Metropolis--Hastings
  • Theorem 2: Ergodicity of rlmh
  • proof : Proof of \ref{['lem: ergodic']}
  • ...and 22 more