Table of Contents
Fetching ...

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Gabin Schieffer, Jacob Wahlgren, Jie Ren, Jennifer Faj, Ivy Peng

TL;DR

This work addresses GPU memory capacity bottlenecks in HPC by evaluating the Grace Hopper CPU-GPU integrated system memory, a hardware-assisted Unified Memory solution. It compares system-allocated memory against CUDA managed memory across six HPC workloads using profiling tools to quantify first-touch behavior, page sizes, and migrations. The study reports detailed characterizations of the integrated system page table, analyzes two memory strategies, and provides optimization guidance for different access patterns and oversubscription scenarios. The findings indicate that system-allocated memory benefits many workloads with minimal porting, offering practical guidance for leveraging hardware-enabled unified memory in CPU-GPU platforms.

Abstract

Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a Unified Memory system. In this work, we provide the first in-depth study of the system memory management on the Grace Hopper Superchip, in both in-memory and memory oversubscription scenarios. We provide a suite of six representative applications, including the Qiskit quantum computing simulator, using system memory and managed memory. Using our memory utilization profiler and hardware counters, we quantify and characterize the impact of the integrated CPU-GPU system page table on GPU applications. Our study focuses on first-touch policy, page table entry initialization, page sizes, and page migration. We identify practical optimization strategies for different access patterns. Our results show that as a new solution for unified memory, the system-allocated memory can benefit most use cases with minimal porting efforts.

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

TL;DR

This work addresses GPU memory capacity bottlenecks in HPC by evaluating the Grace Hopper CPU-GPU integrated system memory, a hardware-assisted Unified Memory solution. It compares system-allocated memory against CUDA managed memory across six HPC workloads using profiling tools to quantify first-touch behavior, page sizes, and migrations. The study reports detailed characterizations of the integrated system page table, analyzes two memory strategies, and provides optimization guidance for different access patterns and oversubscription scenarios. The findings indicate that system-allocated memory benefits many workloads with minimal porting, offering practical guidance for leveraging hardware-enabled unified memory in CPU-GPU platforms.

Abstract

Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a Unified Memory system. In this work, we provide the first in-depth study of the system memory management on the Grace Hopper Superchip, in both in-memory and memory oversubscription scenarios. We provide a suite of six representative applications, including the Qiskit quantum computing simulator, using system memory and managed memory. Using our memory utilization profiler and hardware counters, we quantify and characterize the impact of the integrated CPU-GPU system page table on GPU applications. Our study focuses on first-touch policy, page table entry initialization, page sizes, and page migration. We identify practical optimization strategies for different access patterns. Our results show that as a new solution for unified memory, the system-allocated memory can benefit most use cases with minimal porting efforts.
Paper Structure (24 sections, 13 figures, 2 tables)

This paper contains 24 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: An overview architecture of the Grace Hopper platform that interconnects CPU and GPU with high-throughput cache-coherent NVLink-C2C.
  • Figure 2: A snippet of code transformation from a typical CUDA code with explicit memory copy to Unified Memory.
  • Figure 3: An overview of the relative performance of the system-allocated memory and the managed memory version, in terms of speedup, compared to the original explicit data copy version in six applications. No specific optimizations are included.
  • Figure 4: The memory usage patterns over time in hotspot using system memory (left) and CUDA managed memory (right).
  • Figure 5: The memory usage patterns over time in Qiskit Quantum Volume simulation using system memory and managed memory, respectively.
  • ...and 8 more figures