Table of Contents
Fetching ...

Multi-GPU Acceleration of PALABOS Fluid Solver using C++ Standard Parallelism

Jonas Latt, Christophe Coreixas

TL;DR

This work addresses accelerating Palabos, a lattice Boltzmann CFD library, on GPUs while preserving its CPU-oriented API. It introduces AcceleratedLattice, a data-oriented GPU container that complements the existing MultiBlockLattice, enabling hybrid CPU/GPU execution and incremental porting through a push-pull data flow and per-cell model tagging that avoids virtual function calls. The authors validate the GPU port on Taylor-Green vortex, lid-driven cavity, and Berea sandstone benchmarks, achieving near-peak single-GPU performance and good weak/strong scaling across multiple GPUs, with careful attention to memory bandwidth and MPI-CUDA-aware halo exchanges. The approach emphasizes portability and ease of use, relying solely on ISO C++ standard parallelism while maintaining minimal code duplication, and lays groundwork for further overlap of communication and computation and expansion to additional accelerators.

Abstract

This article presents the principles, software architecture, and performance analysis of the GPU port of the lattice Boltzmann software library Palabos (J. Latt et al., "Palabos: Parallel lattice Boltzmann solver", Comput. Math. Appl. 81, 334-350, (2021)). A hybrid CPU-GPU execution model is adopted, in which numerical components are selectively assigned to either the CPU or the GPU, depending on considerations of performance or convenience. This design enables a progressive porting strategy, allowing most features of the original CPU-based codebase to be gradually and seamlessly adapted to GPU execution. The new architecture builds upon two complementary paradigms: a classical object-oriented structure for CPU execution, and a data-oriented counterpart for GPUs, which reproduces the modularity of the original code while eliminating object-oriented overhead detrimental to GPU performance. Central to this approach is the use of modern C++, including standard parallel algorithms and template metaprogramming techniques, which permit the generation of hardware-agnostic computational kernels. This facilitates the development of user-defined, GPU-accelerated components such as collision operators or boundary conditions, while preserving compatibility with the existing codebase and avoiding the need for external libraries or non-standard language extensions. The correctness and performance of the GPU-enabled Palabos are demonstrated through a series of three-dimensional multiphysics benchmarks, including the laminar-turbulent transition in a Taylor-Green vortex, lid-driven cavity flow, and pore-scale flow in Berea sandstone. Despite the high-level abstraction of the implementation, the single-GPU performance is similar to CUDA-native solvers, and multi-GPU tests exhibit good weak and strong scaling across all test cases.

Multi-GPU Acceleration of PALABOS Fluid Solver using C++ Standard Parallelism

TL;DR

This work addresses accelerating Palabos, a lattice Boltzmann CFD library, on GPUs while preserving its CPU-oriented API. It introduces AcceleratedLattice, a data-oriented GPU container that complements the existing MultiBlockLattice, enabling hybrid CPU/GPU execution and incremental porting through a push-pull data flow and per-cell model tagging that avoids virtual function calls. The authors validate the GPU port on Taylor-Green vortex, lid-driven cavity, and Berea sandstone benchmarks, achieving near-peak single-GPU performance and good weak/strong scaling across multiple GPUs, with careful attention to memory bandwidth and MPI-CUDA-aware halo exchanges. The approach emphasizes portability and ease of use, relying solely on ISO C++ standard parallelism while maintaining minimal code duplication, and lays groundwork for further overlap of communication and computation and expansion to additional accelerators.

Abstract

This article presents the principles, software architecture, and performance analysis of the GPU port of the lattice Boltzmann software library Palabos (J. Latt et al., "Palabos: Parallel lattice Boltzmann solver", Comput. Math. Appl. 81, 334-350, (2021)). A hybrid CPU-GPU execution model is adopted, in which numerical components are selectively assigned to either the CPU or the GPU, depending on considerations of performance or convenience. This design enables a progressive porting strategy, allowing most features of the original CPU-based codebase to be gradually and seamlessly adapted to GPU execution. The new architecture builds upon two complementary paradigms: a classical object-oriented structure for CPU execution, and a data-oriented counterpart for GPUs, which reproduces the modularity of the original code while eliminating object-oriented overhead detrimental to GPU performance. Central to this approach is the use of modern C++, including standard parallel algorithms and template metaprogramming techniques, which permit the generation of hardware-agnostic computational kernels. This facilitates the development of user-defined, GPU-accelerated components such as collision operators or boundary conditions, while preserving compatibility with the existing codebase and avoiding the need for external libraries or non-standard language extensions. The correctness and performance of the GPU-enabled Palabos are demonstrated through a series of three-dimensional multiphysics benchmarks, including the laminar-turbulent transition in a Taylor-Green vortex, lid-driven cavity flow, and pore-scale flow in Berea sandstone. Despite the high-level abstraction of the implementation, the single-GPU performance is similar to CUDA-native solvers, and multi-GPU tests exhibit good weak and strong scaling across all test cases.

Paper Structure

This paper contains 20 sections, 5 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: The grid covering the simulation domain is partitioned into regular blocks. On the CPU, each block is processed sequentially by its assigned CPU core. On the GPU, each block is processed in parallel through the shared-memory formalism of parallel algorithms.
  • Figure 2: Array-of-Structure versus Structure-of-Array data layout in an example with four data elements per grid node.
  • Figure 3: A succession of polymorphic objects in the object-oriented structure translates to a unique integer identifier for the data-oriented data structure, exploiting Palabos's built-in unique string identification of numerical model classes.
  • Figure 4: Illustration of the framework developed to (1) create accelerated data structures (AcceleratedLattice) from Palabos's original object-oriented data structures (MultiBlockLattice), and (2) use both of them in an hybrid fashion.
  • Figure 5: Taylor-Green vortex simulations for $\mathrm{Re}=1600$, $\mathrm{Ma}=0.2$, various mesh resolutions $L\in\{128,256,512\}$, and most collision models implemented in Palabos. Time evolution of the normalized kinetic energy $k/k_0$ (top) and enstrophy $\epsilon/\epsilon_0$. The subscript $0$ stands for quantities at $t=0$, and the convective time is defined as $t_c=2\pi L/ u_{\infty}$.
  • ...and 8 more figures