Table of Contents
Fetching ...

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC

Siyuan Shen, Mikhail Khalilov, Lukas Gianinazzi, Timo Schneider, Marcin Chrapek, Jai Dayal, Manisha Gajbe, Robert Wisniewski, Torsten Hoefler

TL;DR

EDAN introduces an Execution DAG–based toolchain that converts runtime instruction traces into eDAGs to quantify memory latency sensitivity in HPC workloads. It develops a Brent-inspired memory cost model and derives two metrics, λ and Λ, to rank and compare memory-latency sensitivity across programs, while also estimating theoretical bandwidth via data movement on the eDAG. Validation against gem5 on PolyBench demonstrates reasonable alignment in latency-sensitivity rankings and a substantial productivity advantage, with HPCG and LULESH case studies illustrating cache and memory-depth interactions. The work enables architecture-aware programming and hardware design by providing a fast, scalable method to analyze memory-latency effects in HPC applications, guiding cache sizing, memory disaggregation, and parallelization strategies.

Abstract

Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC

TL;DR

EDAN introduces an Execution DAG–based toolchain that converts runtime instruction traces into eDAGs to quantify memory latency sensitivity in HPC workloads. It develops a Brent-inspired memory cost model and derives two metrics, λ and Λ, to rank and compare memory-latency sensitivity across programs, while also estimating theoretical bandwidth via data movement on the eDAG. Validation against gem5 on PolyBench demonstrates reasonable alignment in latency-sensitivity rankings and a substantial productivity advantage, with HPCG and LULESH case studies illustrating cache and memory-depth interactions. The work enables architecture-aware programming and hardware design by providing a fast, scalable method to analyze memory-latency effects in HPC applications, guiding cache sizing, memory disaggregation, and parallelization strategies.

Abstract

Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.

Paper Structure

This paper contains 26 sections, 6 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Simulation time of Polybench kernels (small size) using QEMU emulation with instruction tracing (EDAN), and gem5 cycle-approximate simulation. Runtime on a RISC-V chip is used as the baseline for slowdown calculations.
  • Figure 2: A simple C program calculating the sum of 3 variables and its corresponding eDAG.
  • Figure 3: High-level overview of the EDAN toolchain.
  • Figure 4: Kernel in C that sums all elements in an array.
  • Figure 5: Section of the trace from the summation kernel.
  • ...and 11 more figures