Table of Contents
Fetching ...

Optimal parallelisation strategies for flat histogram Monte Carlo sampling

Hubert J. Naguszewski, Christopher D. Woodgate, David Quigley

TL;DR

The paper addresses how to efficiently parallelize flat-histogram Monte Carlo methods, specifically Wang–Landau sampling, to compute phase behavior in lattice models. It benchmarks multiple parallelization strategies—non-uniform energy-domain decomposition, dynamic load balancing, replica exchange, and varying numbers of walkers per domain—using a fixed-lattice AlTiCrMo high-entropy alloy and DOS-based observables. The key finding is that non-uniform energy-domain decomposition yields the largest speedups, with dynamic load balancing providing additional, though smaller, gains; replica exchange largely leaves efficiency unchanged, and using 1–2 walkers per sub-domain is typically sufficient. The study offers concrete, actionable recommendations for accelerating WL simulations in materials science and similar flat-histogram frameworks, enabling higher-throughput exploration of phase diagrams and thermodynamics.

Abstract

Flat histogram methods, such as Wang-Landau sampling, provide a means for high-throughput calculation of phase diagrams of atomistic/lattice model systems. Many parallelisation schemes with varying degrees of complexity have been proposed to accelerate such sampling simulations. In this study, several widely used schemes are benchmarked - both in isolation and in combination - to establish best practice. The schemes studied include energy domain decomposition with both static sizing of energy sub-domains, as well as a dynamic sub-domain sizing scheme which we propose. We also assess the benefits both of replica exchange and of including multiple random walkers per sub-domain, to determine which factors have the largest impact on parallel efficiency. Additionally, the influence of energy sub-domain overlap regions is discussed. As an illustrative test case, we implement and apply the aforementioned strategies to a lattice-based model describing the internal energies of the AlTiCrMo refractory high-entropy superalloy, which is understood to crystallographically order into a B2 (CsCl) structure with decreasing temperature. We find that - while all of the proposed strategies confer a non-negligible speedup - parallelisation across energy domains which are non-uniform in size offers the most appreciable performance improvements. This work offers concrete recommendations for which parallelisation strategies should be prioritised to optimally accelerate flat-histogram Monte Carlo simulations.

Optimal parallelisation strategies for flat histogram Monte Carlo sampling

TL;DR

The paper addresses how to efficiently parallelize flat-histogram Monte Carlo methods, specifically Wang–Landau sampling, to compute phase behavior in lattice models. It benchmarks multiple parallelization strategies—non-uniform energy-domain decomposition, dynamic load balancing, replica exchange, and varying numbers of walkers per domain—using a fixed-lattice AlTiCrMo high-entropy alloy and DOS-based observables. The key finding is that non-uniform energy-domain decomposition yields the largest speedups, with dynamic load balancing providing additional, though smaller, gains; replica exchange largely leaves efficiency unchanged, and using 1–2 walkers per sub-domain is typically sufficient. The study offers concrete, actionable recommendations for accelerating WL simulations in materials science and similar flat-histogram frameworks, enabling higher-throughput exploration of phase diagrams and thermodynamics.

Abstract

Flat histogram methods, such as Wang-Landau sampling, provide a means for high-throughput calculation of phase diagrams of atomistic/lattice model systems. Many parallelisation schemes with varying degrees of complexity have been proposed to accelerate such sampling simulations. In this study, several widely used schemes are benchmarked - both in isolation and in combination - to establish best practice. The schemes studied include energy domain decomposition with both static sizing of energy sub-domains, as well as a dynamic sub-domain sizing scheme which we propose. We also assess the benefits both of replica exchange and of including multiple random walkers per sub-domain, to determine which factors have the largest impact on parallel efficiency. Additionally, the influence of energy sub-domain overlap regions is discussed. As an illustrative test case, we implement and apply the aforementioned strategies to a lattice-based model describing the internal energies of the AlTiCrMo refractory high-entropy superalloy, which is understood to crystallographically order into a B2 (CsCl) structure with decreasing temperature. We find that - while all of the proposed strategies confer a non-negligible speedup - parallelisation across energy domains which are non-uniform in size offers the most appreciable performance improvements. This work offers concrete recommendations for which parallelisation strategies should be prioritised to optimally accelerate flat-histogram Monte Carlo simulations.

Paper Structure

This paper contains 18 sections, 19 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Conceptual illustrations of the schemes discussed in this work for parallelising and/or accelerating parallel Wang--Landau sampling implementations. Panel (a) illustrates uniform energy domain decomposition, where the energy domain $[E_{\mathrm{min}}, E_{\mathrm{max}}]$ is evenly partitioned into sub-domains, with fixed percentage overlapping regions (shaded). Panel (b) illustrates non-uniform energy domain decomposition, where the energy domain $[E_{\mathrm{min}}, E_{\mathrm{max}}]$ is partitioned into non-uniform sub-domains, with fixed percentage overlapping regions (shaded). Panel (c) illustrates replica exchange, where independent walkers sampling within overlap regions can (occasionally) exchange configurations with neighbouring sub-domains (illustrated by red/blue particles), which allows for crossing configuration barriers. Finally, panel (d) represents dynamic load balancing, where energy sub-domains are adaptively adjusted after each Wang--Landau iteration based on the time taken to converge each sub-domain.
  • Figure 2: Comparison of the speedup per WL instance for different choices of energy sub-domain overlap, taken relative to the case of a single walker across the entire energy domain. This plot is for simulations using Method 3 with a single walker per energy sub-domain. The dotted red line shows the square of the number of energy sub-domains, i.e., the number of WL instances used squared ($h^2$), while the solid red line shows the number of energy sub-domains, i.e., the number of WL instances ($h$). It can be seen that, for all choices of energy sub-domain overlap region size (excluding the 75% case), the speedup is significantly above 100% and that the choice of size of overlap region between energy sub-domains has noticeable impact on speedup.
  • Figure 3: Normalised root mean square error (RMSE) as a function of the percentage overlap. The RMSE was obtained by comparing 23 iteration sampling runs with 16 windows to a well-converged 36 iteration sampling run with 1 window. The error bars correspond to the standard deviation of time taken of 5 repeat WL simulations. For each number of overlap bins, 5 separate 23 iteration simulations were performed and averaged. These data show that there is no correlation between the size of the energy sub-domain overlap region and the accuracy of the obtained global DOS up to a 50% overlap. Beyond 50% overlap the averaging power of the sampling is reduced hence the increase in normalised RMSE though not to a statistically significant extent.
  • Figure 4: Comparison of the speedup per WL instance for different numbers of walkers per energy sub-domain, taken relative to the case of a single walker across the entire energy domain. This plot is for Method 3 with a 25% energy sub-domain overlap. The error bars correspond to the standard deviation of time taken of 5 repeat WL simulations with different seeds on the PRNG. These data show that the optimal choices for efficiency are the 1 and 2 walker cases, with the 2 walker case marginally increasing the efficiency of each sub-domain. Additional walkers beyond this appear to confer no further benefit.
  • Figure 5: Comparison of the walker efficiency per WL instance for different numbers of walkers per energy sub-domain, taken relative to the case of a single walker across the entire energy domain. This plot is for Method 3 with a 25% energy sub-domain overlap. The error bars correspond to the standard deviation of time taken of 5 repeat WL simulations with different seeds on the PRNG. The plot shows the efficiency of each walker introduced across 3 methods, with the 1 and 2 walker cases displaying the highest maximum efficiency per Wang--Landau instance.
  • ...and 4 more figures