Acceleration of Parallel Tempering for Markov Chain Monte Carlo methods
Aingeru Ramos, Jose A Pascual, Javier Navaridas, Ivan Coluzza
TL;DR
The paper tackles the challenge of efficiently sampling Boltzmann-distributed states in complex systems using Markov Chain Monte Carlo methods. It develops two parallel implementations of Metropolis-Hastings with Parallel Tempering—one with OpenMP for CPUs and one with CUDA for GPUs—storing data on the device in the CUDA version to minimize transfers. Using a 2D Ising model as a benchmark, it reports substantial speedups (up to 52x with OpenMP and 986x with CUDA) while analyzing convergence behavior and swap overhead. The work provides a practical benchmark for future quantum MCMC approaches and suggests avenues for memory optimization and extending to more complex models.
Abstract
Markov Chain Monte Carlo methods are algorithms used to sample probability distributions, commonly used to sample the Boltzmann distribution of physical/chemical models (e.g., protein folding, Ising model, etc.). This allows us to study their properties by sampling the most probable states of those systems. However, the sampling capabilities of these methods are not sufficiently accurate when handling complex configuration spaces. This has resulted in the development of new techniques that improve sampling accuracy, usually at the expense of increasing the computational cost. One of such techniques is Parallel Tempering which improves accuracy by running several replicas which periodically exchange their states. Computationally, this imposes a significant slow-down, which can be counteracted by means of parallelization. These schemes enable MCMC/PT techniques to be run more effectively and allow larger models to be studied. In this work, we present a parallel implementation of Metropolis-Hastings with Parallel Tempering, using OpenMP and CUDA for the parallelization in modern CPUs and GPUs, respectively. The results show a maximum speed-up of 52x using OpenMP with 48 cores, and of 986x speed-up with the CUDA version. Furthermore, the results serve as a basic benchmark to compare a future quantum implementation of the same algorithm.
