Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Samuel Kemmler; Christoph Rettinger; Ulrich Rüde; Pablo Cuéllar; Harald Köstler

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Samuel Kemmler, Christoph Rettinger, Ulrich Rüde, Pablo Cuéllar, Harald Köstler

TL;DR

This paper addresses the challenge of efficiently simulating fully resolved fluid-particle systems on heterogeneous CPU-GPU architectures by splitting the computation: the Eulerian fluid on GPUs via lattice Boltzmann methods and the Lagrangian particles on CPUs via discrete element methods. It introduces a four-way hybrid coupling framework with a PSC-based fluid-particle interaction, lubrication corrections, and sub-block mappings to reduce communication and improve locality, validated with a detailed roofline analysis and extensive weak/strong scaling on 1024 A100 GPUs. The study demonstrates that hybrid coupling overhead can be negligible and reports up to 71% parallel efficiency in dilute scenarios, while identifying the main bottlenecks as particle synchronization and communication steps. The work provides four practical criteria for efficient hybrid implementations and offers an a priori speedup estimate, showing the approach is viable for large-scale sediment transport and related multiphysics applications on modern heterogeneous HPC systems.

Abstract

Current supercomputers often have a heterogeneous architecture using both CPUs and GPUs. At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on CPUs. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale flow simulation results in only moderate slowdowns due to the efficiency of the CUDA-aware MPI communication, combined with communication hiding techniques. On 1024 A100 GPUs, a parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers. Additionally, an a priori estimate of the speedup for hybrid implementations is suggested.

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

TL;DR

Abstract

Paper Structure (24 sections, 29 equations, 14 figures)

This paper contains 24 sections, 29 equations, 14 figures.

Introduction
Numerical methods
Lattice Boltzmann method
Particle dynamics
Particle interactions using the discrete element method
Integration of the particle properties
Fully resolved fluid-particle coupling method
Lubrication correction
Particle mapping
Implementation
Fluid dynamics and coupling on the GPU
Coupling from the particles to the fluid
Fluid simulation
Coupling from the fluid to the particles
Particle dynamics on the CPU
...and 9 more sections

Figures (14)

Figure 1: Two-dimensional sketch of coupled fluid-particle simulations using the
Figure 2: The linear approximation yields the analytical solution for the blue cells. The particle is represented by the orange circle. Note that the grid is coarsened for better clarity.
Figure 3: Partitioning of a 2D simulation domain into four blocks. The circles with ID 0 to ID 9 indicate the particles, and the blue cells are the fluid. One x for updating the fluid cells is assigned to each block x, and the corresponding x cores are responsible for the particle dynamics. x represents the cores having a direct connection/affinity to x. MPI rank x is assigned to x, distributes the particle computations among x using OpenMP, and uses x for the fluid dynamics.
Figure 4: Flowchart of our hybrid - implementation from the perspective of a and responsible for the same block. The color coding indicates the communication types required within each step.
Figure 5: Overview of the different communication steps from the perspective of a and responsible for the same block
...and 9 more figures

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

TL;DR

Abstract

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (14)