Table of Contents
Fetching ...

Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures

Martin Karp, Estela Suarez, Jan H. Meinke, Måns I. Andersson, Philipp Schlatter, Stefano Markidis, Niclas Jansson

TL;DR

The paper investigates scalable, high-fidelity CFD using the spectral element method on Modular Supercomputing Architecture (MSA) by partitioning the domain across Booster (GPU) and Cluster (CPU) modules. It introduces a lightweight performance model to predict when cross-module execution reduces time-to-solution and validates it through three flow cases on multiple European HPC systems. Key findings show GPUs outperform CPUs for Neko, but cross-module benefits depend on problem size, memory capacity, and I/O load, with notable speedups when GPU memory limits force distribution across modules. The work provides practical guidance for when and how to spread large CFD workloads across heterogeneous modules and highlights implications for future accelerator-centric exascale systems. Overall, the study supports the trend toward GPU-dominated computation in large-scale CFD while acknowledging the continued relevance of memory and I/O considerations in multi-module environments.

Abstract

The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the modular supercomputing architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.

Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures

TL;DR

The paper investigates scalable, high-fidelity CFD using the spectral element method on Modular Supercomputing Architecture (MSA) by partitioning the domain across Booster (GPU) and Cluster (CPU) modules. It introduces a lightweight performance model to predict when cross-module execution reduces time-to-solution and validates it through three flow cases on multiple European HPC systems. Key findings show GPUs outperform CPUs for Neko, but cross-module benefits depend on problem size, memory capacity, and I/O load, with notable speedups when GPU memory limits force distribution across modules. The work provides practical guidance for when and how to spread large CFD workloads across heterogeneous modules and highlights implications for future accelerator-centric exascale systems. Overall, the study supports the trend toward GPU-dominated computation in large-scale CFD while acknowledging the continued relevance of memory and I/O considerations in multi-module environments.

Abstract

The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the modular supercomputing architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.
Paper Structure (19 sections, 10 equations, 5 figures, 3 tables)

This paper contains 19 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visualizations of the three different cases, with red being high and blue being a lower value. To the left is the velocity magnitude in a cross-section of the pipe, in the middle is the pressure field in TGV and to the right, we show the temperature field in turbulent Rayleigh-Bénard convection.
  • Figure 2: Illustration of the performance model for two different computing devices $s_1,s_2$ with different performance characteristics. We denote the modeled time as described in \ref{['eq:model']} as Model $1~s_1:1~s_2$ and we model the best achievable performance based on $P_{opt}(s_1),P_{opt}(s_2)$ with a mix of 1:1, $s_1,s_2$ devices. The strong scaling performance for $|S|$ computing devices with a performance based on Figure \ref{['fig:model_perf']} is shown as in Figure \ref{['fig:model_strong']}.
  • Figure 3: Performance comparison between DEEP and JUWELS for our three different test cases. We show perfect linear scaling for the Booster and Cluster runs with a dotted line while we show the modeled performance with a green solid line without markers for the MSA runs. The modeled time is based on the highest performance $P_{opt}$ for the given case measured on the Booster and Cluster modules.
  • Figure 4: Performance comparison between the LUMI-G and JUWELS-Booster module where we compare utilizing the host for communication (host MPI) and utilizing device-aware MPI where the host is only used to schedule kernels on the device.
  • Figure 5: Performance traces with low-performance overhead from LLView for the GPU nodes (top) and CPU nodes (bottom) for an MSA run of the TGV case using 64 nodes split equally between GPU and CPU nodes (1:1 mix). The metric CPU and GPU usage are defined as the percent of time over the past sample period during which one or more kernels were executing on the GPU. In \ref{['subfig:noio']} a trace with no I/O. In \ref{['subfig:io']} a simulation with extensive I/O is presented.