Improved off-policy training of diffusion samplers

Marcin Sendera; Minsu Kim; Sarthak Mittal; Pablo Lemos; Luca Scimeca; Jarrid Rector-Brooks; Alexandre Adam; Yoshua Bengio; Nikolay Malkin

Improved off-policy training of diffusion samplers

Marcin Sendera, Minsu Kim, Sarthak Mittal, Pablo Lemos, Luca Scimeca, Jarrid Rector-Brooks, Alexandre Adam, Yoshua Bengio, Nikolay Malkin

TL;DR

The paper tackles sampling from unnormalized densities using diffusion models and continuous GFlowNets, introducing a unified diffusion-structured sampler library and a novel local-search replay-buffer exploration to boost sample quality. It provides a thorough comparison of diffusion-based and off-policy methods, showing that simple exploration boosts performance, while Langevin-type inductive biases improve credit assignment, and that local search substantially mitigates mode collapse. Key contributions include empirical benchmarks across diverse densities, analysis of credit-assignment strategies, and a practical off-policy exploration method with demonstrated gains. The work advances amortized inference with diffusion samplers and offers ready-to-use code to promote reproducibility and future research into efficient high-dimensional sampling and latent-variable inference.

Abstract

We study the problem of training diffusion models to sample from a distribution with a given unnormalized density or energy function. We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods (continuous generative flow networks). Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work. We also propose a novel exploration strategy for off-policy methods, based on local search in the target space with the use of a replay buffer, and show that it improves the quality of samples on a variety of target distributions. Our code for the sampling methods and benchmarks studied is made public at https://github.com/GFNOrg/gfn-diffusion as a base for future work on diffusion models for amortized inference.

Improved off-policy training of diffusion samplers

TL;DR

Abstract

Paper Structure (51 sections, 25 equations, 11 figures, 6 tables, 2 algorithms)

This paper contains 51 sections, 25 equations, 11 figures, 6 tables, 2 algorithms.

Introduction
Prior work
Setting: Diffusion-structured sampling
Euler-Maruyama hierarchical samplers
Generative modeling with SDEs.
Time discretization.
SDE learning as hierarchical variational inference.
Euler-Maruyama samplers as GFlowNets
State and action space.
Forward policy and learning problem.
Backward policy and trajectory balance.
Off-policy optimization.
Other objectives.
Exploration and credit assignment in continuous GFlowNets
Credit assignment methods
...and 36 more sections

Figures (11)

Figure 1: Two-dimensional projections of Manywell samples from models trained by different algorithms. Our proposed replay buffer with local search is capable of preventing mode collapse.
Figure 2: Effect of exploration variance on models trained with TB on the 25GMM energy. Exploration promotes mode discovery, but should be decayed over time to optimally allocate the modeling power to high-likelihood trajectories.
Figure 3: Left: Distribution of ${\mathbf{x}}_0,{\mathbf{x}}_{0.1},\dots,{\mathbf{x}}_1$ learned by 10-step samplers with fixed (top) and learned (middle) forward policy variance on the 25GMM energy. The last step of sampling the fixed-variance model adds Gaussian noise of a variance close to that of the components of the target distribution, preventing the the sampler from sharply capturing the modes. The last row shows the policy variance learned as a function of ${\mathbf{x}}_t$ at various time steps $t$ (white is high variance, blue is low), showing that less noise is added around the peaks near $t=1$. The two models' log-partition function estimates are $-1.67$ and $-0.62$, respectively. Right: For varying number of steps $T$, we plot the $\log\hat{Z}$ obtained by models with fixed and learned variance. Learning policy variances gives similar samplers with fewer steps.
Figure D.1: Conditioning data (MNIST test set)
Figure E.1: Capacity $30,000$
...and 6 more figures

Improved off-policy training of diffusion samplers

TL;DR

Abstract

Improved off-policy training of diffusion samplers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)