Performance Analysis of HPC applications on the Aurora Supercomputer: Exploring the Impact of HBM-Enabled Intel Xeon Max CPUs
Huda Ibeid, Vikram Narayana, Jeongnim Kim, Anthony Nguyen, Vitali Morozov, Ye Luo
TL;DR
The paper investigates memory system design for the Aurora exascale system, emphasizing trade-offs between HBM and DDR memory and the impact of memory modes (Flat vs Cache) and clustering modes (Quad vs SNC4) on system and HPC application performance. It employs microbenchmarks (STREAM, Intel Latency Checker, OSU MPI benchmarks) and application trials on HACC, QMCPACK, and BFS to quantify bandwidth, latency, PCIe bandwidth, and MPI performance across configurations. Key findings show that HBM, particularly in SNC4-Flat mode, delivers higher memory bandwidth and lower latency, benefiting memory-bandwidth-bound workloads like HACC, while Cache mode with DDR capacity often helps latency-sensitive or larger-footprint workloads (QMCPACK, BFS) by leveraging DDR capacity; SNC4 requires careful process and NIC affinity to avoid imbalances. The work provides actionable guidance for selecting memory configurations on Aurora-like systems, highlighting the nuanced interplay between memory bandwidth, latency, and memory capacity across diverse HPC workloads and the importance of workload-aware tuning for exascale systems.
Abstract
The Aurora supercomputer is an exascale-class system designed to tackle some of the most demanding computational workloads. Equipped with both High Bandwidth Memory (HBM) and DDR memory, it provides unique trade-offs in performance, latency, and capacity. This paper presents a comprehensive analysis of the memory systems on the Aurora supercomputer, with a focus on evaluating the trade-offs between HBM and DDR memory systems. We explore how different memory configurations, including memory modes (Flat and Cache) and clustering modes (Quad and SNC4), influence key system performance metrics such as memory bandwidth, latency, CPU-GPU PCIe bandwidth, and MPI communication bandwidth. Additionally, we examine the performance of three representative HPC applications -- HACC, QMCPACK, and BFS -- each illustrating the impact of memory configurations on performance. By using microbenchmarks and application-level analysis, we provide insights into how to select the optimal memory system and configuration to maximize performance based on the application characteristics. The findings presented in this paper offer guidance for users of the Aurora system and similar exascale systems.
