Benchmarking with Supernovae: A Performance Study of the FLASH Code
Joshua Martin, Catherine Feldman, Eva Siegmann, Tony Curtis, David Carlson, Firat Coskun, Daniel Wood, Raul Gonzalez, Robert J. Harrison, Alan C. Calder
TL;DR
The paper benchmarks FLASH on modern CPU architectures to quantify performance and energy efficiency for a large-scale 3D Type Ia supernova problem. It compares Intel Sapphire Rapids with HBM, AMD Milan, Intel Skylake, and Fujitsu A64FX-700 using strong scaling across a 220 GB workload and analyzes memory options (HBM vs DDR5) and MPI mappings. The key finding is that Sapphire Rapids with HBM delivers the fastest runtimes and best energy efficiency, while A64FX-700 requires many more nodes and offers a weaker energy advantage; HBM benefits are limited for this compute-heavy code and depend on MPI mapping. The work identifies MPI communication as a primary bottleneck and suggests vectorization and threading optimizations (FLASH-X) and exploration of future hardware (e.g., Fugaku) to improve performance portability for FLASH-like AMR codes.
Abstract
Astrophysical simulations are computation, memory, and thus energy intensive, thereby requiring new hardware advances for progress. Stony Brook University recently expanded its computing cluster "SeaWulf" with an addition of 94 new nodes featuring Intel Sapphire Rapids Xeon Max series CPUs. We present a performance and power efficiency study of this hardware performed with FLASH: a multi-scale, multi-physics, adaptive mesh-based software instrument. We extend this study to compare performance to that of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors, and SeaWulf's AMD EPYC Milan and Intel Skylake nodes. Our application is a stellar explosion known as a thermonuclear (Type Ia) supernova and for this 3D problem, FLASH includes operators for hydrodynamics, gravity, and nuclear burning, in addition to routines for the material equation of state. We perform a strong-scaling study with a 220 GB problem size to explore both single- and multi-node performance. Our study explores the performance of different MPI mappings and the distribution of processors across nodes. From these tests, we determined the optimal configuration to balance runtime and energy consumption for our application.
