Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications
Bennett Cooper, Thomas R. W. Scogland, Rong Ge
TL;DR
The paper investigates AMD's Shared Virtual Memory (SVM) and its interaction with Linux HMM, revealing a range-based memory management design that enables on-demand migrations but can cause severe thrashing when GPU memory is oversubscribed. It employs fine-grained profiling, SystemTap instrumentation, and application case studies to quantify UM management overhead, migration/eviction dynamics, and fault behavior, identifying critical bottlenecks and application-dependent patterns. A key finding is that aggressive prefetching paired with a range-based eviction policy amplifies costs under oversubscription, but SVM-aware algorithms (e.g., reordering computations) and driver adjustments can yield substantial performance gains, sometimes by orders of magnitude. The work provides practical guidance for AMD-based HPC systems and lays a foundation for future driver and algorithm optimizations, including potential extensions to newer architectures like MI300 and broader memory allocation schemes.
Abstract
Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies.
