Table of Contents
Fetching ...

A GPU-ready pseudo-spectral method for direct numerical simulations of multiphase turbulence

Alessio Roccon

TL;DR

This work addresses the computational challenge of directly simulating interface-resolved multiphase turbulence by porting a pseudo-spectral DNS solver to GPUs. The authors combine a Navier–Stokes solver with a phase-field (Cahn–Hilliard) model using a two-tier parallelization: MPI-based 2D pencil domain decomposition and OpenACC offloading with cuFFT for accelerators, facilitated by CUDA unified memory and CUDA-aware MPI. Key contributions include a portable, GPU-ready implementation with batched transforms, kernel-based nonlinear term evaluation, and efficient wall-normal solves, validated by strong scaling on HPC systems and a large-scale demo (2048×1024×1025 grid with 256 droplets). The approach enables high-fidelity multiphase turbulence simulations on heterogeneous hardware and sets the stage for extending to non-NVIDIA architectures (e.g., ROCm) in future work.

Abstract

In this work, we detail the GPU-porting of an in-house pseudo-spectral solver tailored towards large-scale simulations of interface-resolved simulation of drop- and bubble-laden turbulent flows. The code relies on direct numerical simulation of the Navier-Stokes equations, used to describe the flow field, coupled with a phase-field method, used to describe the shape, deformation, and topological changes of the interface of the drops or bubbles. The governing equations -Navier-Stokes and Cahn-Hilliard equations-are solved using a pseudo-spectral method that relies on transforming the variables in the wavenumber space. The code targets large-scale simulations of drop- and bubble-laden turbulent flows and relies on a multilevel parallelism. The first level of parallelism relies on the message-passing interface (MPI) and is used on multi-core architectures in CPU-based infrastructures. A second level of parallelism relies on OpenACC directives and cuFFT libraries and is used to accelerate the code execution when GPU-based infrastructures are targeted. The resulting multiphase flow solver can be efficiently executed in heterogeneous computing infrastructures and exhibits a remarkable speed-up when GPUs are employed. Thanks to the modular structure of the code and the use of a directive-based strategy to offload code execution on GPUs, only minor code modifications are required when targeting different computing architectures. This improves code maintenance, version control and the implementation of additional modules or governing equations.

A GPU-ready pseudo-spectral method for direct numerical simulations of multiphase turbulence

TL;DR

This work addresses the computational challenge of directly simulating interface-resolved multiphase turbulence by porting a pseudo-spectral DNS solver to GPUs. The authors combine a Navier–Stokes solver with a phase-field (Cahn–Hilliard) model using a two-tier parallelization: MPI-based 2D pencil domain decomposition and OpenACC offloading with cuFFT for accelerators, facilitated by CUDA unified memory and CUDA-aware MPI. Key contributions include a portable, GPU-ready implementation with batched transforms, kernel-based nonlinear term evaluation, and efficient wall-normal solves, validated by strong scaling on HPC systems and a large-scale demo (2048×1024×1025 grid with 256 droplets). The approach enables high-fidelity multiphase turbulence simulations on heterogeneous hardware and sets the stage for extending to non-NVIDIA architectures (e.g., ROCm) in future work.

Abstract

In this work, we detail the GPU-porting of an in-house pseudo-spectral solver tailored towards large-scale simulations of interface-resolved simulation of drop- and bubble-laden turbulent flows. The code relies on direct numerical simulation of the Navier-Stokes equations, used to describe the flow field, coupled with a phase-field method, used to describe the shape, deformation, and topological changes of the interface of the drops or bubbles. The governing equations -Navier-Stokes and Cahn-Hilliard equations-are solved using a pseudo-spectral method that relies on transforming the variables in the wavenumber space. The code targets large-scale simulations of drop- and bubble-laden turbulent flows and relies on a multilevel parallelism. The first level of parallelism relies on the message-passing interface (MPI) and is used on multi-core architectures in CPU-based infrastructures. A second level of parallelism relies on OpenACC directives and cuFFT libraries and is used to accelerate the code execution when GPU-based infrastructures are targeted. The resulting multiphase flow solver can be efficiently executed in heterogeneous computing infrastructures and exhibits a remarkable speed-up when GPUs are employed. Thanks to the modular structure of the code and the use of a directive-based strategy to offload code execution on GPUs, only minor code modifications are required when targeting different computing architectures. This improves code maintenance, version control and the implementation of additional modules or governing equations.
Paper Structure (15 sections, 17 equations, 5 figures, 1 table)

This paper contains 15 sections, 17 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: 2D domain decomposition employed for the first level of parallelization (MPI). Each color corresponds to a different MPI task. In physical space, the domain is divided along the $y$ and $z$ directions (pencils orientated along the $x$-direction), while in spectral space it is divided along the $x$ and $y$ directions (pencil orientated along the $z$-direction). Transpositions (i.e. loops of MPI communications) are required to change the pencil orientation and to compute the transform along the various directions: along $x$ in the first configuration shown in panel $a$, along $y$ in the configuration shown in panel $b$, and along $z$ in the configuration shown in panel $c$.
  • Figure 2: Schematic showing the step required to perform an MPI communication between data stored in the buffer of GPU 0 and GPU 1. In standard MPI installations, the data computed on GPU 0 has to be moved into host memory of CPU 0 and then sent to the MPI process 0 and, in turn, transferred to the GPU 1. Using a CUDA-aware MPI implementation that exploit GPUDirect technologies, e.g. Remote Direct Memory Access (RDMA), a direct transfer between GPU 0 and GPU 0 can be performed (long red arrow). Reproduced from cudawmpi.
  • Figure 3: Time elapsed per time step on different machines using all the physical cores available (blue) and all the GPUs available (green) on a single-node. The results have been obtained considering a single-phase turbulent channel flow and a grid resolution equal to $N_x \times N_y \times N_z = 256 \times 256 \times 257$. For all cases, the code has been compiled using the Nvidia Fortran compiler nvfortan with or without the support for GPU-acceleration, depending on the case considered (CPU or GPU). When the nvfortan compiler is not available (LUMI-C), the code has been compiled using gfortran.
  • Figure 4: Strong scaling results for the code FLOW36 obtained on Marconi100. Two different problems sizes have been considered: $N_x \times N_y \times N_z = 512 \times 512 \times 513$ and $N_x \times N_y \times N_z = 1024 \times 1024 \times 1025$. For the smaller problem size (blue dots), tests have been performed starting from 1 node (4 GPUs) up to 32 nodes (128 GPUs) while, for the larger problem size (red dots), starting from 8 nodes (32 GPUs) up to 64 nodes (256 GPUs).
  • Figure 5: Top view of a swarm of large and deformable drops released in a turbulent channel flow. The flow moves from left to right (along the streamwise direction) and drops coalesce and break under the action of turbulence fluctuations. The interface of the drops is identified as the iso-contour $\phi=0$. The grid resolution employed for this demo simulation is $N_x=2048 \times 1024 \times 1025$. This run has been executed using 64 nodes (256 GPUs) of Leonardo turisini2023leonardo.