Table of Contents
Fetching ...

Training Neural Samplers with Reverse Diffusive KL Divergence

Jiajun He, Wenlin Chen, Mingtian Zhang, David Barber, José Miguel Hernández-Lobato

TL;DR

This work introduces reverse diffusive KL divergence (DiKL) to train neural samplers for unnormalized target distributions, addressing the mode-seeking tendency of traditional reverse KL by diffusing both model and target across multiple Gaussian kernels. The method combines denoising score matching (DSM) for model-score estimation with Mixed Score Identity (MSI) for noisy-target scores, enabling a practical, gradient-based training routine for implicit generators. Applied to Boltzmann generators, the DiKL framework leverages equivariant architectures to respect invariances, achieving competitive or superior mode coverage and sampling efficiency on multi-modal energy landscapes. The approach delivers fast, one-shot sampling with strong mass-covering properties, offering a scalable alternative to diffusion-based and flow-based samplers while highlighting areas for future enhancement, such as combining with multi-step strategies and improving posterior sampling stability.

Abstract

Training generative models to sample from unnormalized density functions is an important and challenging task in machine learning. Traditional training methods often rely on the reverse Kullback-Leibler (KL) divergence due to its tractability. However, the mode-seeking behavior of reverse KL hinders effective approximation of multi-modal target distributions. To address this, we propose to minimize the reverse KL along diffusion trajectories of both model and target densities. We refer to this objective as the reverse diffusive KL divergence, which allows the model to capture multiple modes. Leveraging this objective, we train neural samplers that can efficiently generate samples from the target distribution in one step. We demonstrate that our method enhances sampling performance across various Boltzmann distributions, including both synthetic multi-modal densities and n-body particle systems.

Training Neural Samplers with Reverse Diffusive KL Divergence

TL;DR

This work introduces reverse diffusive KL divergence (DiKL) to train neural samplers for unnormalized target distributions, addressing the mode-seeking tendency of traditional reverse KL by diffusing both model and target across multiple Gaussian kernels. The method combines denoising score matching (DSM) for model-score estimation with Mixed Score Identity (MSI) for noisy-target scores, enabling a practical, gradient-based training routine for implicit generators. Applied to Boltzmann generators, the DiKL framework leverages equivariant architectures to respect invariances, achieving competitive or superior mode coverage and sampling efficiency on multi-modal energy landscapes. The approach delivers fast, one-shot sampling with strong mass-covering properties, offering a scalable alternative to diffusion-based and flow-based samplers while highlighting areas for future enhancement, such as combining with multi-step strategies and improving posterior sampling stability.

Abstract

Training generative models to sample from unnormalized density functions is an important and challenging task in machine learning. Traditional training methods often rely on the reverse Kullback-Leibler (KL) divergence due to its tractability. However, the mode-seeking behavior of reverse KL hinders effective approximation of multi-modal target distributions. To address this, we propose to minimize the reverse KL along diffusion trajectories of both model and target densities. We refer to this objective as the reverse diffusive KL divergence, which allows the model to capture multiple modes. Leveraging this objective, we train neural samplers that can efficiently generate samples from the target distribution in one step. We demonstrate that our method enhances sampling performance across various Boltzmann distributions, including both synthetic multi-modal densities and n-body particle systems.

Paper Structure

This paper contains 35 sections, 5 theorems, 60 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Proposition 4.1

For any convolution kernel $k(x_t|x)$, we have where $p_\theta(x|x_t)\propto k(x_t|x)p_\theta(x)$ is the model posterior.

Figures (9)

  • Figure 1: We convolve a Gaussian kernel $\mathcal{N}(\tilde{x}|x,\sigma^2)$ with $\sigma\in\{5,10\}$ to the original distribution $p(x)$. This demonstrates that Gaussian convolution can bridge modes and even reduce the number of modes as the variance of the Gaussian increases.
  • Figure 2: Heatmap of (log scale) KL divergence at different noise levels between a Gaussian model (with mean parameter $\mu$ and standard deviation parameter $\sigma$) and a two-mode MoG target in 1D. At lower noise levels (or in the extreme case, the standard reverse KL), the divergence is highly mode-seeking, with the model favoring either one of the two modes in the target distribution. However, perhaps surprisingly, the KL divergence becomes more mass-covering at a higher noise level, encouraging the model to cover both modes of the target.
  • Figure 3: Samples on MoG-40. We train each method for 2.5 hours, which allows all to converge. FAB and iDEM use replay buffers as in midgley2023flowakhound2024iterated. The high-density regions of this target are within $[-50, 50]$. All methods were trained on the original scale, except for iDEM, which is normalized to $[-1, 1]$ following akhound2024iterated. This normalization may simplify the task.
  • Figure 4: 2D marginal (1st and 3rd dimensions) of samples from MW-32. Our approach and FAB manage to find all the modes with correct weights, iDEM finds all modes but with wrong weights, and the neural sampler trained with standard KL divergence only capture one mode.
  • Figure 5: Left. Wasserstein-2 ($\mathcal{W}$-2) distance of samples and total variation distance (TVD) of energy on MW-32. Our method and FAB clearly outperform iDEM and KL in this evaluation. Right. Histogram of sample energy. Our approach outperforms both FAB and iDEM. Note that although the KL approach yields better energy, it captures only one mode, as shown in \ref{['fig:manywell_vis']}.
  • ...and 4 more figures

Theorems & Definitions (17)

  • Definition 3.1: Spread KL Divergence
  • Definition 3.2: Diffusive KL Divergence
  • Proposition 4.1: Denoising Score Identity
  • Proposition 4.2: Target Score Identity
  • Proposition 4.3: Mixed Score Identity
  • Proposition 6.1
  • proof
  • proof
  • proof
  • proof
  • ...and 7 more