Enhancing Gradient-based Discrete Sampling via Parallel Tempering
Luxu Liang, Yuhang Jia, Feng Zhou
TL;DR
This work tackles gradient-based sampling in discrete, multimodal spaces where local minima impede effective exploration. It introduces PT-DULA, a Parallel Tempering enhanced Discrete Langevin Proposal that runs multiple replicas at temperatures with adjacent chain swaps via a tailored Metropolis criterion to preserve detailed balance. The authors provide asymptotic and non-asymptotic convergence analyses, show faster mixing than single-chain methods, and offer an automatic scheme to choose the temperature schedule and number of chains. Empirically, PT-DULA outperforms strong baselines on synthetic MoG/MoS distributions, sampling from RBMs, and training deep energy-based models, demonstrating improved mode coverage and sample quality with scalable mini-batch variants. These results highlight PT-DULA as a robust, dataset-adaptive tool for navigating complex discrete energy landscapes in both sampling and learning tasks.
Abstract
While gradient-based discrete samplers are effective in sampling from complex distributions, they are susceptible to getting trapped in local minima, particularly in high-dimensional, multimodal discrete distributions, owing to the discontinuities inherent in these landscapes. To circumvent this issue, we combine parallel tempering, also known as replica exchange, with the discrete Langevin proposal and develop the Parallel Tempering enhanced Discrete Langevin Proposal (PTDLP), which are simulated at a series of temperatures. Significant energy differences prompt sample swaps, which are governed by a Metropolis criterion specifically designed for discrete sampling to ensure detailed balance is maintained. Additionally, we introduce an automatic scheme to determine the optimal temperature schedule and the number of chains, ensuring adaptability across diverse tasks with minimal tuning. Theoretically, we establish that our algorithm converges non-asymptotically to the target energy and exhibits faster mixing compared to a single chain. Empirical results further emphasize the superiority of our method in sampling from complex, multimodal discrete distributions, including synthetic problems, restricted Boltzmann machines, and deep energy-based models.
