Table of Contents
Fetching ...

Reinforced sequential Monte Carlo for amortised sampling

Sanghyeok Choi, Sarthak Mittal, Víctor Elvira, Jinkyoo Park, Nikolay Malkin

TL;DR

This work tackles sampling from high‑dimensional unnormalised densities by integrating amortised hierarchical samplers with particle methods through a unified framework that connects hierarchical variational inference, MaxEnt RL, and sequential Monte Carlo with AIS. It introduces off‑policy entropic RL with TB and SubTB losses and learnable flows to shape proposals and twisted targets, enabling effective training with samples from SMC as a behaviour policy. An importance‑weighted replay buffer and adaptive tempering mitigate gradient variance and promote stable, diverse exploration of multimodal targets, with diffusion and discrete prepending/append operations demonstrated. Empirical results on synthetic multimodal benchmarks and the alanine dipeptide Boltzmann distribution show improved distribution approximation, mode coverage, and training stability over both amortised and Monte Carlo baselines, highlighting a general, scalable framework for combining RL, variational inference, and Monte Carlo methods in sampling tasks.

Abstract

This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.

Reinforced sequential Monte Carlo for amortised sampling

TL;DR

This work tackles sampling from high‑dimensional unnormalised densities by integrating amortised hierarchical samplers with particle methods through a unified framework that connects hierarchical variational inference, MaxEnt RL, and sequential Monte Carlo with AIS. It introduces off‑policy entropic RL with TB and SubTB losses and learnable flows to shape proposals and twisted targets, enabling effective training with samples from SMC as a behaviour policy. An importance‑weighted replay buffer and adaptive tempering mitigate gradient variance and promote stable, diverse exploration of multimodal targets, with diffusion and discrete prepending/append operations demonstrated. Empirical results on synthetic multimodal benchmarks and the alanine dipeptide Boltzmann distribution show improved distribution approximation, mode coverage, and training stability over both amortised and Monte Carlo baselines, highlighting a general, scalable framework for combining RL, variational inference, and Monte Carlo methods in sampling tasks.

Abstract

This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.

Paper Structure

This paper contains 74 sections, 25 equations, 9 figures, 9 tables, 7 algorithms.

Figures (9)

  • Figure 1: Visualisation of generated samples for GMM40 ($d=2$) target in gradient-free setting.
  • Figure 2: Visualisation of generated samples for ManyWell ($d=32$) target in gradient-free setting.
  • Figure 3: Visualisation of generated samples for Robot4 ($d=10$) target in gradient-based setting. We adopt the visualisation method in chen2025sequential.
  • Figure 4: Visualisation of generated samples for ManyWell ($d=64$) target in gradient-based setting.
  • Figure 5: Visualisation of intermediate marginals for GMM40 ($d=2$) target. "Ground Truth" are the intermediate marginals defined by the target $\pi\propto R$ and $\overleftarrow{p}$\ref{['eq:pdest_diffusion']}, which are still Gaussian mixture models that can be obtained analytically. Note that $\overrightarrow{p}_0$ is a 2d Gaussian distribution $\mathcal{N}(0,\sigma^2I)$, and $N=64$, i.e., $F_{64}=R$.
  • ...and 4 more figures