RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Benjamin Brock; Aydın Buluç; Katherine Yelick

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Benjamin Brock, Aydın Buluç, Katherine Yelick

TL;DR

This paper tackles the bottlenecks of distributed sparse matrix multiplication on GPUs by introducing asynchronous, RDMA-based SpMM and SpGEMM algorithms that operate without inner-loop synchronization. It develops dense and CSR-based remote data structures, RDMA execution models (stationary C/A/B), and work-stealing strategies, all implemented with NVSHMEM and GPUDirect RDMA. A roofline-inspired performance model characterizes inter-node communication versus local compute, and extensive experiments on Summit and DGX-2 show RDMA methods often outperform bulk-synchronous SUMMA, especially for communication-bound regimes and imbalanced workloads. The work demonstrates the practical impact of RDMA-based asynchronous kernels for large-scale graph analytics and related data-intensive applications, and provides a framework for further optimization and integration with existing GPU-based libraries.

Abstract

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

TL;DR

Abstract

Paper Structure (26 sections, 5 figures, 2 tables, 3 algorithms)

This paper contains 26 sections, 5 figures, 2 tables, 3 algorithms.

Introduction
Background
Distributed Matrix Algorithms
Bulk Synchronous SUMMA
RDMA and Asynchrony
RDMA-Based Algorithms
Data Structures
Reading Tiles
Modifying Remote Tiles
Algorithms
RDMA Stationary C Algorithm
RDMA Stationary A and B Algorithms
Optimizations
Workstealing Algorithms
Random workstealing
...and 11 more sections

Figures (5)

Figure 1: Total (end-to-end) vs. per-stage load balance multiplying a R-MAT model-generated sparse matrix with a sparse 2D algorithm. Simulated on a $16 \times 16$ process grid.
Figure 2: Inter-node roofline plots for SpMM and SpGEMM with a 2D distribution. SpMM plot models performance for different widths of the dense B matrix at a fixed scale (24 GPUs), while SpGEMM models performance at different scales. Dashed horizontal lines represent local roofline peaks for SpMM and SpGEMM operations, while vertical lines represent inter-node roofline peaks for particular problems.
Figure 3: Single-node runtimes for SpMM, with different numbers of columns $N$ in the dense matrix B.
Figure 4: Multi-node runtimes for SpMM, with different numbers of columns $N$ in the dense matrix B.
Figure 5: SpGEMM strong scaling experiments.

Theorems & Definitions (1)

proof

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

TL;DR

Abstract

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)