Multi-GPU Acceleration of PALABOS Fluid Solver using C++ Standard Parallelism
Jonas Latt, Christophe Coreixas
TL;DR
This work addresses accelerating Palabos, a lattice Boltzmann CFD library, on GPUs while preserving its CPU-oriented API. It introduces AcceleratedLattice, a data-oriented GPU container that complements the existing MultiBlockLattice, enabling hybrid CPU/GPU execution and incremental porting through a push-pull data flow and per-cell model tagging that avoids virtual function calls. The authors validate the GPU port on Taylor-Green vortex, lid-driven cavity, and Berea sandstone benchmarks, achieving near-peak single-GPU performance and good weak/strong scaling across multiple GPUs, with careful attention to memory bandwidth and MPI-CUDA-aware halo exchanges. The approach emphasizes portability and ease of use, relying solely on ISO C++ standard parallelism while maintaining minimal code duplication, and lays groundwork for further overlap of communication and computation and expansion to additional accelerators.
Abstract
This article presents the principles, software architecture, and performance analysis of the GPU port of the lattice Boltzmann software library Palabos (J. Latt et al., "Palabos: Parallel lattice Boltzmann solver", Comput. Math. Appl. 81, 334-350, (2021)). A hybrid CPU-GPU execution model is adopted, in which numerical components are selectively assigned to either the CPU or the GPU, depending on considerations of performance or convenience. This design enables a progressive porting strategy, allowing most features of the original CPU-based codebase to be gradually and seamlessly adapted to GPU execution. The new architecture builds upon two complementary paradigms: a classical object-oriented structure for CPU execution, and a data-oriented counterpart for GPUs, which reproduces the modularity of the original code while eliminating object-oriented overhead detrimental to GPU performance. Central to this approach is the use of modern C++, including standard parallel algorithms and template metaprogramming techniques, which permit the generation of hardware-agnostic computational kernels. This facilitates the development of user-defined, GPU-accelerated components such as collision operators or boundary conditions, while preserving compatibility with the existing codebase and avoiding the need for external libraries or non-standard language extensions. The correctness and performance of the GPU-enabled Palabos are demonstrated through a series of three-dimensional multiphysics benchmarks, including the laminar-turbulent transition in a Taylor-Green vortex, lid-driven cavity flow, and pore-scale flow in Berea sandstone. Despite the high-level abstraction of the implementation, the single-GPU performance is similar to CUDA-native solvers, and multi-GPU tests exhibit good weak and strong scaling across all test cases.
