Table of Contents
Fetching ...

Enhancing Gradient-based Discrete Sampling via Parallel Tempering

Luxu Liang, Yuhang Jia, Feng Zhou

TL;DR

This work tackles gradient-based sampling in discrete, multimodal spaces where local minima impede effective exploration. It introduces PT-DULA, a Parallel Tempering enhanced Discrete Langevin Proposal that runs multiple replicas at temperatures with adjacent chain swaps via a tailored Metropolis criterion to preserve detailed balance. The authors provide asymptotic and non-asymptotic convergence analyses, show faster mixing than single-chain methods, and offer an automatic scheme to choose the temperature schedule and number of chains. Empirically, PT-DULA outperforms strong baselines on synthetic MoG/MoS distributions, sampling from RBMs, and training deep energy-based models, demonstrating improved mode coverage and sample quality with scalable mini-batch variants. These results highlight PT-DULA as a robust, dataset-adaptive tool for navigating complex discrete energy landscapes in both sampling and learning tasks.

Abstract

While gradient-based discrete samplers are effective in sampling from complex distributions, they are susceptible to getting trapped in local minima, particularly in high-dimensional, multimodal discrete distributions, owing to the discontinuities inherent in these landscapes. To circumvent this issue, we combine parallel tempering, also known as replica exchange, with the discrete Langevin proposal and develop the Parallel Tempering enhanced Discrete Langevin Proposal (PTDLP), which are simulated at a series of temperatures. Significant energy differences prompt sample swaps, which are governed by a Metropolis criterion specifically designed for discrete sampling to ensure detailed balance is maintained. Additionally, we introduce an automatic scheme to determine the optimal temperature schedule and the number of chains, ensuring adaptability across diverse tasks with minimal tuning. Theoretically, we establish that our algorithm converges non-asymptotically to the target energy and exhibits faster mixing compared to a single chain. Empirical results further emphasize the superiority of our method in sampling from complex, multimodal discrete distributions, including synthetic problems, restricted Boltzmann machines, and deep energy-based models.

Enhancing Gradient-based Discrete Sampling via Parallel Tempering

TL;DR

This work tackles gradient-based sampling in discrete, multimodal spaces where local minima impede effective exploration. It introduces PT-DULA, a Parallel Tempering enhanced Discrete Langevin Proposal that runs multiple replicas at temperatures with adjacent chain swaps via a tailored Metropolis criterion to preserve detailed balance. The authors provide asymptotic and non-asymptotic convergence analyses, show faster mixing than single-chain methods, and offer an automatic scheme to choose the temperature schedule and number of chains. Empirically, PT-DULA outperforms strong baselines on synthetic MoG/MoS distributions, sampling from RBMs, and training deep energy-based models, demonstrating improved mode coverage and sample quality with scalable mini-batch variants. These results highlight PT-DULA as a robust, dataset-adaptive tool for navigating complex discrete energy landscapes in both sampling and learning tasks.

Abstract

While gradient-based discrete samplers are effective in sampling from complex distributions, they are susceptible to getting trapped in local minima, particularly in high-dimensional, multimodal discrete distributions, owing to the discontinuities inherent in these landscapes. To circumvent this issue, we combine parallel tempering, also known as replica exchange, with the discrete Langevin proposal and develop the Parallel Tempering enhanced Discrete Langevin Proposal (PTDLP), which are simulated at a series of temperatures. Significant energy differences prompt sample swaps, which are governed by a Metropolis criterion specifically designed for discrete sampling to ensure detailed balance is maintained. Additionally, we introduce an automatic scheme to determine the optimal temperature schedule and the number of chains, ensuring adaptability across diverse tasks with minimal tuning. Theoretically, we establish that our algorithm converges non-asymptotically to the target energy and exhibits faster mixing compared to a single chain. Empirical results further emphasize the superiority of our method in sampling from complex, multimodal discrete distributions, including synthetic problems, restricted Boltzmann machines, and deep energy-based models.

Paper Structure

This paper contains 55 sections, 8 theorems, 71 equations, 5 figures, 5 tables, 3 algorithms.

Key Result

Lemma 4.1

$\tau_{\mathcal{B}}(\cdot)$ is optimized when we run $\mathcal{B}^*= \lfloor K_{\text{total}} / K^* \rfloor$ copies of PT with $K^{\ast}=2 \Lambda + 1$.

Figures (5)

  • Figure 1: The blue, green, and red dots correspond to probability functions at three temperatures. The high-probability areas to sample from are indicated by dashed lines.
  • Figure 2: Sampling performance (measured by EMC) of various methods for MoG (left) and MoS (right) with varying components. Sampling performance of various interations for 8 Gaussions and 16 Gaussions. PT-DMALA consistently outperforms baselines across random seeds.
  • Figure 3: RBM sampling results with local mode initialization. PT-DMALA achieves faster convergence, while baseline methods converge slower due to being trapped in the mode.
  • Figure 4: Images sampled from RBM trained on MNIST when the sampler is initialized to most likely mode. Our algorithm is able to generate a diverse range of digits, demonstrating its ability to escape from modes.
  • Figure 5: The images on the top row are examples from the dataset, while the bottom row are from the trained EBM. The images generated from our algorithm are similar to those from the dataset, demonstrating that the model is capable of generating high-quality samples.

Theorems & Definitions (14)

  • Lemma 4.1
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Corollary 5.4
  • Lemma D.1: nadler2007dynamicssyed2022non
  • proof : Proof of \ref{['lemma_4_2']}
  • Lemma D.2
  • proof
  • proof : Proof of \ref{['thm_5_1']}
  • ...and 4 more