Table of Contents
Fetching ...

Policy Gradients for Optimal Parallel Tempering MCMC

Daniel Zhao, Natesh S. Pillai

TL;DR

This work tackles adaptive temperature ladder design for parallel tempering MCMC by formulating temperature selection as a stateless policy-gradient problem aimed at maximizing long-run sampler efficiency. It introduces the swap mean-distance as a reward component and provides convergence guarantees via diminishing adaptation, showing that the temperature schedule can be updated on-the-fly without compromising ergodicity. Empirical results across multimodal and rugged distributions demonstrate that a policy gradient with swap mean-distance achieves substantially lower integrated autocorrelation time $\tau_h$ than geometrically spaced ladders and uniform-acceptance baselines. The findings suggest that reward shaping beyond uniform acceptance can meaningfully improve mixing, offering a practical, theoretically grounded approach to adaptive parallel tempering with potential broad impact for Bayesian computation in challenging distributions.

Abstract

Parallel tempering is a meta-algorithm for Markov Chain Monte Carlo that uses multiple chains to sample from tempered versions of the target distribution, enhancing mixing in multi-modal distributions that are challenging for traditional methods. The effectiveness of parallel tempering is heavily influenced by the selection of chain temperatures. Here, we present an adaptive temperature selection algorithm that dynamically adjusts temperatures during sampling using a policy gradient approach. Experiments demonstrate that our method can achieve lower integrated autocorrelation times compared to traditional geometrically spaced temperatures and uniform acceptance rate schemes on benchmark distributions.

Policy Gradients for Optimal Parallel Tempering MCMC

TL;DR

This work tackles adaptive temperature ladder design for parallel tempering MCMC by formulating temperature selection as a stateless policy-gradient problem aimed at maximizing long-run sampler efficiency. It introduces the swap mean-distance as a reward component and provides convergence guarantees via diminishing adaptation, showing that the temperature schedule can be updated on-the-fly without compromising ergodicity. Empirical results across multimodal and rugged distributions demonstrate that a policy gradient with swap mean-distance achieves substantially lower integrated autocorrelation time than geometrically spaced ladders and uniform-acceptance baselines. The findings suggest that reward shaping beyond uniform acceptance can meaningfully improve mixing, offering a practical, theoretically grounded approach to adaptive parallel tempering with potential broad impact for Bayesian computation in challenging distributions.

Abstract

Parallel tempering is a meta-algorithm for Markov Chain Monte Carlo that uses multiple chains to sample from tempered versions of the target distribution, enhancing mixing in multi-modal distributions that are challenging for traditional methods. The effectiveness of parallel tempering is heavily influenced by the selection of chain temperatures. Here, we present an adaptive temperature selection algorithm that dynamically adjusts temperatures during sampling using a policy gradient approach. Experiments demonstrate that our method can achieve lower integrated autocorrelation times compared to traditional geometrically spaced temperatures and uniform acceptance rate schemes on benchmark distributions.
Paper Structure (20 sections, 1 theorem, 17 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 20 sections, 1 theorem, 17 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Let $\{X_t\}_{t \in \mathbb{N}}$ be Markov chain and $P_{\Gamma_{t}}(x, \cdot)$ the $t^{\text{th}}$ adapted conditional density for $X_{t+1}$ given $X_t = x$. Then the Strong Law of Large Numbers holds for $\{X_t\}$ if two conditions are met:

Figures (5)

  • Figure 1: Plot of the ACT over time for geometrically spaced temperatures, uniform acceptance rate Vousden_2015, and policy gradient algorithm with different reward functions. Target is the egg-box distribution. Each step represents $400$ iterations of the policy gradient update.
  • Figure 2: Evolution of log $\beta$ and acceptance rates over $4000$ update steps. Target is the egg-box distribution. Data is thinned by a factor of $100$.
  • Figure 3: Averaged negative log likelihoods of the samples each time step of the algorithm. The shaded region indicates a $95$% confidence interval over $10$ trials.
  • Figure 4: Scatter plot of swap mean-distance against ACT.
  • Figure 5: Box plot of final temperatures after $4000$ iterations with the swap mean-distance reward function.

Theorems & Definitions (1)

  • Theorem 4.1