Table of Contents
Fetching ...

Adaptive Multidimensional Quadrature on Multi-GPU Systems

Melanie Tonarelli, Simone Riva, Pietro Benedusi, Fabrizio Ferrandi, Rolf Krause

TL;DR

The paper tackles high-dimensional deterministic integration on distributed GPUs, where adaptivity induces severe load imbalance. It introduces a distributed adaptive quadrature framework with a decentralised, round-robin load redistribution implemented via CUDA-aware MPI to balance subdomain workloads while refining many regions per iteration. The approach extends single-GPU methods to multi-GPU systems by over-partitioning the domain, maintaining in-device data layouts, and exchanging compact progress metadata to ensure convergence. Empirical results show competitive performance against a state-of-the-art CPU/GPU framework, with feasibility up to $d=11$ and robustness to oscillatory and discontinuous integrands, highlighting the method’s practical value for high-dimensional numerical integration.

Abstract

We introduce a distributed adaptive quadrature method that formulates multidimensional integration as a hierarchical domain decomposition problem on multi-GPU architectures. The integration domain is recursively partitioned into subdomains whose refinement is guided by local error estimators. Each subdomain evolves independently on a GPU, which exposes a significant load imbalance as the adaptive process progresses. To address this challenge, we introduce a decentralised load redistribution schemes based on a cyclic round-robin policy. This strategy dynamically rebalance subdomains across devices through non-blocking, CUDA-aware MPI communication that overlaps with computation. The proposed strategy has two main advantages compared to a state-of-the-art GPU-tailored package: higher efficiency in high dimensions; and improved robustness w.r.t the integrand regularity and the target accuracy.

Adaptive Multidimensional Quadrature on Multi-GPU Systems

TL;DR

The paper tackles high-dimensional deterministic integration on distributed GPUs, where adaptivity induces severe load imbalance. It introduces a distributed adaptive quadrature framework with a decentralised, round-robin load redistribution implemented via CUDA-aware MPI to balance subdomain workloads while refining many regions per iteration. The approach extends single-GPU methods to multi-GPU systems by over-partitioning the domain, maintaining in-device data layouts, and exchanging compact progress metadata to ensure convergence. Empirical results show competitive performance against a state-of-the-art CPU/GPU framework, with feasibility up to and robustness to oscillatory and discontinuous integrands, highlighting the method’s practical value for high-dimensional numerical integration.

Abstract

We introduce a distributed adaptive quadrature method that formulates multidimensional integration as a hierarchical domain decomposition problem on multi-GPU architectures. The integration domain is recursively partitioned into subdomains whose refinement is guided by local error estimators. Each subdomain evolves independently on a GPU, which exposes a significant load imbalance as the adaptive process progresses. To address this challenge, we introduce a decentralised load redistribution schemes based on a cyclic round-robin policy. This strategy dynamically rebalance subdomains across devices through non-blocking, CUDA-aware MPI communication that overlaps with computation. The proposed strategy has two main advantages compared to a state-of-the-art GPU-tailored package: higher efficiency in high dimensions; and improved robustness w.r.t the integrand regularity and the target accuracy.

Paper Structure

This paper contains 5 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Workflows of the single- and multi-GPU solvers.
  • Figure 2: Comparison between GM and PAGANI as a function of the prescribed tolerance $\tau_{\text{rel}} = 10^{-k}$ ($x$-axis showing $k$) on a single GPU.
  • Figure 3: Comparison between PAGANI on a single GPU and our strategy (GM) on two GPUs. (a) Feasibility comparison for test functions $f_1$ and $f_5$ across different dimensions. The bars indicate the strictest relative tolerances at which convergence was achieved. (b) Speedup of GM w.r.t. PAGANI.
  • Figure 4: Performance of the multi-GPU solver under the round-robin policy as a function of the number of ranks. (a) Strong scaling for $f_2$ and $f_6$ in $d=6$. (b) Computation and idle time fractions for $f_3$ and $f_6$ in $d=6$ at $\tau_{\text{rel}} = 10^{-8}$.