Table of Contents
Fetching ...

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip

Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler

TL;DR

The paper tackles data movement in tightly coupled heterogeneous systems by performing a thorough microbenchmark study of the Quad GH200 Alps node, which pairs a Grace CPU with Hopper GPUs under a cache-coherent NVLink-C2C interconnect and a Slingshot network. It develops a datapath-oriented benchmark suite to map bandwidth and latency across local and peer memories, and validates findings with real workloads including GEMM, LLM inference, and NCCL, illustrating how data placement and memory pathways dominate performance in memory-bound scenarios. Key findings show that the C2C unified memory and ATS-enabled memory systems enable powerful pooling across CPUs and GPUs, yet performance remains highly sensitive to where data is allocated and how memory is accessed, with Hopper and Grace exhibiting distinct caching and ATS behaviors. The work informs memory-placement strategies and programming models for future tightly coupled heterogeneous systems, highlighting the practical importance of data layout decisions for memory-bound HPC and ML workloads.

Abstract

Heterogeneous supercomputers have become the standard in HPC. GPUs in particular have dominated the accelerator landscape, offering unprecedented performance in parallel workloads and unlocking new possibilities in fields like AI and climate modeling. With many workloads becoming memory-bound, improving the communication latency and bandwidth within the system has become a main driver in the development of new architectures. The Grace Hopper Superchip (GH200) is a significant step in the direction of tightly coupled heterogeneous systems, in which all CPUs and GPUs share a unified address space and support transparent fine grained access to all main memory on the system. We characterize both intra- and inter-node memory operations on the Quad GH200 nodes of the new Swiss National Supercomputing Centre Alps supercomputer, and show the importance of careful memory placement on example workloads, highlighting tradeoffs and opportunities.

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip

TL;DR

The paper tackles data movement in tightly coupled heterogeneous systems by performing a thorough microbenchmark study of the Quad GH200 Alps node, which pairs a Grace CPU with Hopper GPUs under a cache-coherent NVLink-C2C interconnect and a Slingshot network. It develops a datapath-oriented benchmark suite to map bandwidth and latency across local and peer memories, and validates findings with real workloads including GEMM, LLM inference, and NCCL, illustrating how data placement and memory pathways dominate performance in memory-bound scenarios. Key findings show that the C2C unified memory and ATS-enabled memory systems enable powerful pooling across CPUs and GPUs, yet performance remains highly sensitive to where data is allocated and how memory is accessed, with Hopper and Grace exhibiting distinct caching and ATS behaviors. The work informs memory-placement strategies and programming models for future tightly coupled heterogeneous systems, highlighting the practical importance of data layout decisions for memory-bound HPC and ML workloads.

Abstract

Heterogeneous supercomputers have become the standard in HPC. GPUs in particular have dominated the accelerator landscape, offering unprecedented performance in parallel workloads and unlocking new possibilities in fields like AI and climate modeling. With many workloads becoming memory-bound, improving the communication latency and bandwidth within the system has become a main driver in the development of new architectures. The Grace Hopper Superchip (GH200) is a significant step in the direction of tightly coupled heterogeneous systems, in which all CPUs and GPUs share a unified address space and support transparent fine grained access to all main memory on the system. We characterize both intra- and inter-node memory operations on the Quad GH200 nodes of the new Swiss National Supercomputing Centre Alps supercomputer, and show the importance of careful memory placement on example workloads, highlighting tradeoffs and opportunities.
Paper Structure (26 sections, 19 figures, 3 tables)

This paper contains 26 sections, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Architecture of the Quad GH200 node of the Alps supercomputer. Every node is composed of four GH200 fully connected using NVLink and a cache coherent interconnect. Every GH200 is connected to a Slingshot network through a separate NIC.
  • Figure 2: Maximum bandwidth plotted against access latency achieved by Grace (left) and Hopper (right) to different memories of the system. The suffix "-p" indicates memory on a peer GH200.
  • Figure 3: Theoretical bandwidth bound for (left to right) read and write operations, copy operations issued by a Grace, and copy operations issued by a Hopper. Bounds depend on the datapath, which in turn depends on the types of memories involved. The bounds are shown in GB/s and include the limiting interconnect.
  • Figure 4: Execution time of an artificial application using system-allocated memory and managed memory (lower is better).
  • Figure 5: Throughput (GB/s) achieved by cudaMemcpy on different combinations of source and destination memory types. Note that Device and HBM memory are physically co-located but are allocated using different APIs (cudaMalloc vs numa_alloc_onnode).
  • ...and 14 more figures