Table of Contents
Fetching ...

Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

Bennett Cooper, Thomas R. W. Scogland, Rong Ge

TL;DR

The paper investigates AMD's Shared Virtual Memory (SVM) and its interaction with Linux HMM, revealing a range-based memory management design that enables on-demand migrations but can cause severe thrashing when GPU memory is oversubscribed. It employs fine-grained profiling, SystemTap instrumentation, and application case studies to quantify UM management overhead, migration/eviction dynamics, and fault behavior, identifying critical bottlenecks and application-dependent patterns. A key finding is that aggressive prefetching paired with a range-based eviction policy amplifies costs under oversubscription, but SVM-aware algorithms (e.g., reordering computations) and driver adjustments can yield substantial performance gains, sometimes by orders of magnitude. The work provides practical guidance for AMD-based HPC systems and lays a foundation for future driver and algorithm optimizations, including potential extensions to newer architectures like MI300 and broader memory allocation schemes.

Abstract

Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies.

Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

TL;DR

The paper investigates AMD's Shared Virtual Memory (SVM) and its interaction with Linux HMM, revealing a range-based memory management design that enables on-demand migrations but can cause severe thrashing when GPU memory is oversubscribed. It employs fine-grained profiling, SystemTap instrumentation, and application case studies to quantify UM management overhead, migration/eviction dynamics, and fault behavior, identifying critical bottlenecks and application-dependent patterns. A key finding is that aggressive prefetching paired with a range-based eviction policy amplifies costs under oversubscription, but SVM-aware algorithms (e.g., reordering computations) and driver adjustments can yield substantial performance gains, sometimes by orders of magnitude. The work provides practical guidance for AMD-based HPC systems and lays a foundation for future driver and algorithm optimizations, including potential extensions to newer architectures like MI300 and broader memory allocation schemes.

Abstract

Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies.
Paper Structure (16 sections, 13 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: SVM manages the UM space by ranges, rather than pages in host and device memory domains
  • Figure 2: Example range creation for three 1.5 GB allocations
  • Figure 3: Timeline of range migration for a serviceable fault. "Evict" only occurs if there is insufficient space for "Alloc".
  • Figure 4: SVM Architecture. The steps are components' interactions in response to a page fault originating from a compute unit (CUs).
  • Figure 5: The cost of SVM UM management and range migration. SGEMM is shown in two windows as the magnitude of the second visually erases the first.
  • ...and 8 more figures