cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

Emiliano Tolotti; Anas Jnini; Flavio Vella; Roberto Passerone

cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

Emiliano Tolotti, Anas Jnini, Flavio Vella, Roberto Passerone

TL;DR

cuVegas presents a CUDA-based implementation of the VEGAS+ adaptive multidimensional Monte Carlo algorithm, maximizing GPU parallelism through a batch-oriented evaluation scheme, on-GPU map updates, and multi-GPU data sharing. It integrates adaptive importance sampling and adaptive stratified sampling with estimation aggregation, delivering substantial speedups over CPU VEGAS and competing GPU frameworks, especially for integrands with multiple peaks or diagonal structures. The paper provides extensive performance analyses, including benchmarks on Asian option pricing and Feynman path integrals, and demonstrates strong multi-GPU scalability with careful memory and RNG optimizations. The work shows that VEGAS+ on GPUs can achieve practical, real-world improvements in accuracy-time tradeoffs for high-dimensional integration tasks, with a usable Python binding for integration into scientific workflows.

Abstract

This paper introduces cuVegas, a CUDA-based implementation of the Vegas Enhanced Algorithm (VEGAS+), optimized for multi-dimensional integration in GPU environments. The VEGAS+ algorithm is an advanced form of Monte Carlo integration, recognized for its adaptability and effectiveness in handling complex, high-dimensional integrands. It employs a combination of variance reduction techniques, namely adaptive importance sampling and a variant of adaptive stratified sampling, that make it particularly adept at managing integrands with multiple peaks or those aligned with the diagonals of the integration volume. Being a Monte Carlo integration method, the task is well suited for parallelization and for GPU execution. Our implementation, cuVegas, aims to harness the inherent parallelism of GPUs, addressing the challenge of workload distribution that often hampers efficiency in standard implementations. We present a comprehensive analysis comparing cuVegas with existing CPU and GPU implementations, demonstrating significant performance improvements, from two to three orders of magnitude on CPUs, and from a factor of two on GPUs over the best existing implementation. We also demonstrate the speedup for integrands for which VEGAS+ was designed, with multiple peaks or other significant structures aligned with diagonals of the integration volume.

cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 8 figures, 10 tables, 2 algorithms)

This paper contains 31 sections, 11 equations, 8 figures, 10 tables, 2 algorithms.

Introduction
Background and Methodology
The VEGAS Enhanced Algorithm
Adaptive Importance Sampling
Adaptive Stratified Sampling
Estimation Aggregation
Related Work
Parallelization challenges
CUDA Implementation of the VEGAS Enhanced Algorithm
Parallelization strategy
Algorithm Overview and Pseudocode
Implementation details and optimizations
Program time analysis
Optimization of Random Number Generation
Accumulation and Data Reduction Techniques
...and 16 more sections

Figures (8)

Figure 1: Parallelization diagram of the program in a single GPU setting.
Figure 2: Parallelization diagram of the program in a multi GPU setting.
Figure 3: VegasFill kernel performance scaling, changing algorithm parameters. The testing parameters are reported in Table \ref{['tab:scaling']}. Blue dots represent mean execution time values of the total kernel calls in the program and bars represent standard error. Orange and light blue dots represent minimum and maximum execution time respectively.
Figure 4: Performance comparison of cuVegas, Vegas, TorchQuad and VegasFlow across seven test functions. On the $y$-axis the average wall-clock time is plotted against the average relative standard error on the $x$-axis. Axes are in log-scale. Lines represent the geometric mean over the seven integrands.
Figure 5: Speedup of multiple GPUs with respect to the single GPU version for the Ridge integrand, varying the number of function evaluations.
...and 3 more figures

cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

TL;DR

Abstract

cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

Authors

TL;DR

Abstract

Table of Contents

Figures (8)