Table of Contents
Fetching ...

Benchmarking with Supernovae: A Performance Study of the FLASH Code

Joshua Martin, Catherine Feldman, Eva Siegmann, Tony Curtis, David Carlson, Firat Coskun, Daniel Wood, Raul Gonzalez, Robert J. Harrison, Alan C. Calder

TL;DR

The paper benchmarks FLASH on modern CPU architectures to quantify performance and energy efficiency for a large-scale 3D Type Ia supernova problem. It compares Intel Sapphire Rapids with HBM, AMD Milan, Intel Skylake, and Fujitsu A64FX-700 using strong scaling across a 220 GB workload and analyzes memory options (HBM vs DDR5) and MPI mappings. The key finding is that Sapphire Rapids with HBM delivers the fastest runtimes and best energy efficiency, while A64FX-700 requires many more nodes and offers a weaker energy advantage; HBM benefits are limited for this compute-heavy code and depend on MPI mapping. The work identifies MPI communication as a primary bottleneck and suggests vectorization and threading optimizations (FLASH-X) and exploration of future hardware (e.g., Fugaku) to improve performance portability for FLASH-like AMR codes.

Abstract

Astrophysical simulations are computation, memory, and thus energy intensive, thereby requiring new hardware advances for progress. Stony Brook University recently expanded its computing cluster "SeaWulf" with an addition of 94 new nodes featuring Intel Sapphire Rapids Xeon Max series CPUs. We present a performance and power efficiency study of this hardware performed with FLASH: a multi-scale, multi-physics, adaptive mesh-based software instrument. We extend this study to compare performance to that of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors, and SeaWulf's AMD EPYC Milan and Intel Skylake nodes. Our application is a stellar explosion known as a thermonuclear (Type Ia) supernova and for this 3D problem, FLASH includes operators for hydrodynamics, gravity, and nuclear burning, in addition to routines for the material equation of state. We perform a strong-scaling study with a 220 GB problem size to explore both single- and multi-node performance. Our study explores the performance of different MPI mappings and the distribution of processors across nodes. From these tests, we determined the optimal configuration to balance runtime and energy consumption for our application.

Benchmarking with Supernovae: A Performance Study of the FLASH Code

TL;DR

The paper benchmarks FLASH on modern CPU architectures to quantify performance and energy efficiency for a large-scale 3D Type Ia supernova problem. It compares Intel Sapphire Rapids with HBM, AMD Milan, Intel Skylake, and Fujitsu A64FX-700 using strong scaling across a 220 GB workload and analyzes memory options (HBM vs DDR5) and MPI mappings. The key finding is that Sapphire Rapids with HBM delivers the fastest runtimes and best energy efficiency, while A64FX-700 requires many more nodes and offers a weaker energy advantage; HBM benefits are limited for this compute-heavy code and depend on MPI mapping. The work identifies MPI communication as a primary bottleneck and suggests vectorization and threading optimizations (FLASH-X) and exploration of future hardware (e.g., Fugaku) to improve performance portability for FLASH-like AMR codes.

Abstract

Astrophysical simulations are computation, memory, and thus energy intensive, thereby requiring new hardware advances for progress. Stony Brook University recently expanded its computing cluster "SeaWulf" with an addition of 94 new nodes featuring Intel Sapphire Rapids Xeon Max series CPUs. We present a performance and power efficiency study of this hardware performed with FLASH: a multi-scale, multi-physics, adaptive mesh-based software instrument. We extend this study to compare performance to that of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors, and SeaWulf's AMD EPYC Milan and Intel Skylake nodes. Our application is a stellar explosion known as a thermonuclear (Type Ia) supernova and for this 3D problem, FLASH includes operators for hydrodynamics, gravity, and nuclear burning, in addition to routines for the material equation of state. We perform a strong-scaling study with a 220 GB problem size to explore both single- and multi-node performance. Our study explores the performance of different MPI mappings and the distribution of processors across nodes. From these tests, we determined the optimal configuration to balance runtime and energy consumption for our application.
Paper Structure (13 sections, 8 figures, 3 tables)

This paper contains 13 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Top panel: Strong-scaling studies on all configurations available on SeaWulf from 4 to 96 cores (single node). Missing data points are explained above. Bottom panel: The same strong-scaling study but expanded out to 384 cores and expressed as a log plot.
  • Figure 2: Per-cell per-timestep CPU time on all configurations available on SeaWulf from 4 to 384 cores. Missing data points are explained above.
  • Figure 3: Energy consumption for strong-scaling study on all configurations available on SeaWulf from 4 to 384 cores. Missing data points are explained above.
  • Figure 4: Top panel: Strong-scaling study on A64FX-700. Reported is the max evolution runtime. The two horizontal gray lines show the runtime for SPR+GCC with HBM for 192 (top, dashed line) and 384 (bottom, dotted line) cores, the points for which would be off this scale. Bottom panel: The maximum evolution per-cell per-timestep CPU time.
  • Figure 5: Energy consumption on A64FX-700 The two horizontal gray lines show the energy consumption for SPR+GCC with HBM for 192 (bottom, dashed line) and 384 (top, dotted line) cores, the points for which would be off this scale.
  • ...and 3 more figures