Table of Contents
Fetching ...

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde

TL;DR

The paper tackles the high memory and performance cost of LBM on sparse geometries by introducing an indirect-addressing, sparse data structure (PDF-list and index-list) integrated with a code-generation pipeline (lbmpy) and the waLBerla framework. It demonstrates that sparse LBM kernels, generation of diverse stencils and collision models, and optimizations like in-place AA streaming and communication hiding yield substantial memory savings and strong scalability on CPU and GPU HPC systems, including 1024 NVIDIA A100 GPUs and 4096 AMD MI250X GPUs. The authors validate the approach with three realistic applications—porous media flow, flow over a particle bed, and coronary artery flow—showing speed-ups up to ~11× in kernel-only runs and memory reductions up to ~75%, while also addressing load-balancing challenges in heterogeneous blocks and at scale. The work presents a practical path to large-scale, sparse-domain CFD simulations on modern HPC architectures and highlights future opportunities in broader hardware support and architecture-specific optimizations.

Abstract

We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications.

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

TL;DR

The paper tackles the high memory and performance cost of LBM on sparse geometries by introducing an indirect-addressing, sparse data structure (PDF-list and index-list) integrated with a code-generation pipeline (lbmpy) and the waLBerla framework. It demonstrates that sparse LBM kernels, generation of diverse stencils and collision models, and optimizations like in-place AA streaming and communication hiding yield substantial memory savings and strong scalability on CPU and GPU HPC systems, including 1024 NVIDIA A100 GPUs and 4096 AMD MI250X GPUs. The authors validate the approach with three realistic applications—porous media flow, flow over a particle bed, and coronary artery flow—showing speed-ups up to ~11× in kernel-only runs and memory reductions up to ~75%, while also addressing load-balancing challenges in heterogeneous blocks and at scale. The work presents a practical path to large-scale, sparse-domain CFD simulations on modern HPC architectures and highlights future opportunities in broader hardware support and architecture-specific optimizations.

Abstract

We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications.
Paper Structure (20 sections, 13 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Exemplary setup of a sparse simulation domain in 2D with a low percentage of fluid covering the domain (light blue), and a high number of obstacle cells. Visualisation of the block partitioning with extraction of blocks without fluid. Illustration of a dense and a sparse data structure for an exemplary setup of 5x5 cells per block and a D2Q5 stencil. While the dense data structure stores PDFs and operates on all cells, the sparse data structure only stores and operates on fluid cells.
  • Figure 2: Structure of the PDF-list and the index-list for an exemplary D2Q5 velocity set. The domain contains fluid cells (white), ghost layers (light yellow), velocity-bounce-back (UBB) boundaries (light blue) and no-slip boundaries (grey). The directions of the PDF stencil are indicated by colors as well. In direction west there is a MPI interface to the neighboring block considered. North, east and south cells next to the presented cells are also considered as no-slip cells.
  • Figure 3: Complete workflow of the code generation pipeline of lbmpy. For a full CFD application compute kernels as well as boundary and communication kernels are generated.
  • Figure 4: Single GPU benchmark for sparse, dense and hybrid data structure with varying porosity on a NVIDIA A100 with $256^3$ cells, D3Q19 velocity set and SRT collision operator. The theoretical performance is calculated from the bandwidth of a streaming benchmark (1361 GB/s) and the theoretical number of memory accesses of the kernels, as LBM code is usually memory bound.
  • Figure 5: Memory consumption benchmark for $256^3$ cells on a single NVIDIA A100 GPU for D3Q19 stencil and pull streaming pattern. The theoretical memory consumption is calculated in \ref{['equ:mem']}.
  • ...and 12 more figures