Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper
Junjie Li, Yinzhi Wang, Xiao Liang, Hang Liu
TL;DR
Porting CPU BLAS-heavy codes to GPUs is hindered by data movement bottlenecks in traditional architectures. The authors introduce a drop-in tool that intercepts CPU BLAS calls and offloads to GPU-accelerated BLAS using Grace-Hopper's cache-coherent Unified Memory Architecture, aided by three data-management strategies, including a first-touch migration scheme. Across dgemm, PARSEC, and MuST benchmarks, the approach yields substantial speedups over pure Grace CPU and over NVBLAS, with Strategy 3 (automatic data migration) often delivering the best results. The work provides a practical, code-change-free path to exploit unified memory GPUs for scientific workloads, enabling faster exploration and porting of BLAS-heavy applications.
Abstract
Porting codes to GPU often requires major efforts. While several tools exist for automatically offload numerical libraries such as BLAS and LAPACK, they often prove impractical due to the high cost of mandatory data transfer. The new unified memory architecture in NVIDIA Grace-Hopper allows high bandwidth cache-coherent memory access of all memory from both CPU and GPU, potentially eliminating bottleneck faced in conventional architecture. This breakthrough opens up new avenues for application development and porting strategies. In this study, we introduce a new tool for automatic BLAS offload, the tool leverages the high speed cache coherent NVLink C2C interconnect in Grace-Hopper, and enables performant GPU offload for BLAS heavy applications with no code changes or recompilation. The tool was tested on two quantum chemistry or physics codes, great performance benefits were observed.
