Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures
Martin Karp, Estela Suarez, Jan H. Meinke, Måns I. Andersson, Philipp Schlatter, Stefano Markidis, Niclas Jansson
TL;DR
The paper investigates scalable, high-fidelity CFD using the spectral element method on Modular Supercomputing Architecture (MSA) by partitioning the domain across Booster (GPU) and Cluster (CPU) modules. It introduces a lightweight performance model to predict when cross-module execution reduces time-to-solution and validates it through three flow cases on multiple European HPC systems. Key findings show GPUs outperform CPUs for Neko, but cross-module benefits depend on problem size, memory capacity, and I/O load, with notable speedups when GPU memory limits force distribution across modules. The work provides practical guidance for when and how to spread large CFD workloads across heterogeneous modules and highlights implications for future accelerator-centric exascale systems. Overall, the study supports the trend toward GPU-dominated computation in large-scale CFD while acknowledging the continued relevance of memory and I/O considerations in multi-module environments.
Abstract
The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the modular supercomputing architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.
