Table of Contents
Fetching ...

Heterogeneous Memory Pool Tuning

Filip Vaverka, Ondrej Vysocky, Lubomir Riha

TL;DR

The work addresses memory bandwidth bottlenecks on modern heterogeneous-memory platforms by introducing a lightweight tool that analyzes and controls allocation-level data placement between DDR and on-package HBM. It combines allocation instrumentation with performance counters to build a planning model and demonstrates, on benchmarks like NAS NP and k-Wave, that substantial speedups are achievable when a meaningful fraction of data resides in HBM. Key findings show that, for several benchmarks, near-peak performance is attainable with roughly 60–75% of data in HBM, with 25–30% remaining in DDR, highlighting practical data-placement strategies. The approach provides a practical path for developers and tuning tools to optimize data layout for heterogeneous memory, improving efficiency on high-bandwidth platforms.

Abstract

We present a lightweight tool for the analysis and tuning of application data placement in systems with heterogeneous memory pools. The tool allows non-intrusively identifying, analyzing, and controlling the placement of individual allocations of the application. We use the tool to analyze a set of benchmarks running on the Intel Sapphire Rapids platform with both HBM and DDR memory. The paper also contains an analysis of the performance of both memory subsystems in terms of read/write bandwidth and latency. The key part of the analysis is to focus on performance if both subsystems are used together. We show that only about 60% to 75% of the data must be placed in HBM memory to achieve 90% of the potential performance of the platform on those benchmarks.

Heterogeneous Memory Pool Tuning

TL;DR

The work addresses memory bandwidth bottlenecks on modern heterogeneous-memory platforms by introducing a lightweight tool that analyzes and controls allocation-level data placement between DDR and on-package HBM. It combines allocation instrumentation with performance counters to build a planning model and demonstrates, on benchmarks like NAS NP and k-Wave, that substantial speedups are achievable when a meaningful fraction of data resides in HBM. Key findings show that, for several benchmarks, near-peak performance is attainable with roughly 60–75% of data in HBM, with 25–30% remaining in DDR, highlighting practical data-placement strategies. The approach provides a practical path for developers and tuning tools to optimize data layout for heterogeneous memory, improving efficiency on high-bandwidth platforms.

Abstract

We present a lightweight tool for the analysis and tuning of application data placement in systems with heterogeneous memory pools. The tool allows non-intrusively identifying, analyzing, and controlling the placement of individual allocations of the application. We use the tool to analyze a set of benchmarks running on the Intel Sapphire Rapids platform with both HBM and DDR memory. The paper also contains an analysis of the performance of both memory subsystems in terms of read/write bandwidth and latency. The key part of the analysis is to focus on performance if both subsystems are used together. We show that only about 60% to 75% of the data must be placed in HBM memory to achieve 90% of the potential performance of the platform on those benchmarks.

Paper Structure

This paper contains 9 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: NUMA architecture of dual Intel Xeon Max 9468 server in flat SC4 mode.
  • Figure 2: Memory bandwidth measured with STREAM benchmark with all data in DDR or HBM memory.
  • Figure 3: The on-package HBM of Intel Xeon Max 9468 exhibits about 20% higher latency compared to the DDR memory.
  • Figure 4: Random memory access speedup for summation of randomly spaced values and random pointer chase in 32G array uniformly spread over all DDR or HBM memory nodes of a single socket. Speedup below one means DDR memory is faster than HBM.
  • Figure 5: Memory bandwidth of Copy and Add sub-tests of STREAM benchmark in relation to placement (DDR or HBM) of each work array (16G per array).
  • ...and 10 more figures