RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
Benjamin Brock, Aydın Buluç, Katherine Yelick
TL;DR
This paper tackles the bottlenecks of distributed sparse matrix multiplication on GPUs by introducing asynchronous, RDMA-based SpMM and SpGEMM algorithms that operate without inner-loop synchronization. It develops dense and CSR-based remote data structures, RDMA execution models (stationary C/A/B), and work-stealing strategies, all implemented with NVSHMEM and GPUDirect RDMA. A roofline-inspired performance model characterizes inter-node communication versus local compute, and extensive experiments on Summit and DGX-2 show RDMA methods often outperform bulk-synchronous SUMMA, especially for communication-bound regimes and imbalanced workloads. The work demonstrates the practical impact of RDMA-based asynchronous kernels for large-scale graph analytics and related data-intensive applications, and provides a framework for further optimization and integration with existing GPU-based libraries.
Abstract
Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.
