Table of Contents
Fetching ...

HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores

Zhonggen Li, Xiangyu Ke, Yifan Zhu, Yunjun Gao, Yaofeng Tu

TL;DR

HC-SpMM presents a pioneering SpMM kernel that exploits hybrid GPU cores (CUDA and Tensor cores) to accelerate sparse–dense matrix multiplication for graphs. By partitioning the adjacency matrix into row windows and using a lightweight logistic-regression-based core selector, it adaptively assigns submatrices to the most suitable core, complemented by kernel fusion and LOA layout optimization to improve data reuse and memory access. The approach delivers up to 1.33× average SpMM speedup and 1.23× average GNN training speedup across 14 real-world graphs, outperforming state-of-the-art CUDA-only and Tensor-core methods. This work enhances the practicality of large-scale GNN training by leveraging heterogeneous GPU architectures and optimizing data layout and kernel design.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in graph computing and analytics. However, the irregularity of real-world graphs poses significant challenges to achieving efficient SpMM operation for graph data on GPUs. Recently, significant advancements in GPU computing power and the introduction of new efficient computing cores within GPUs offer new opportunities for acceleration. In this paper, we present HC-SpMM, a pioneering algorithm that leverages hybrid GPU cores (Tensor cores and CUDA cores) to accelerate SpMM for graphs. To adapt to the computing characteristics of different GPU cores, we investigate the impact of sparse graph features on the performance of different cores, develop a data partitioning technique for the graph adjacency matrix, and devise a novel strategy for intelligently selecting the most efficient cores for processing each submatrix. Additionally, we optimize it by considering memory access and thread utilization, to utilize the computational resources to their fullest potential. To support complex graph computing workloads, we integrate HC-SpMM into the GNN training pipeline. Furthermore, we propose a kernel fusion strategy to enhance data reuse, as well as a cost-effective graph layout reorganization method to mitigate the irregular and sparse issues of real-world graphs, better fitting the computational models of hybrid GPU cores. Extensive experiments on 14 real-world graph datasets demonstrate that HC-SpMM achieves an average speedup of 1.33x and 1.23x over state-of-the-art SpMM kernels and GNN frameworks.

HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores

TL;DR

HC-SpMM presents a pioneering SpMM kernel that exploits hybrid GPU cores (CUDA and Tensor cores) to accelerate sparse–dense matrix multiplication for graphs. By partitioning the adjacency matrix into row windows and using a lightweight logistic-regression-based core selector, it adaptively assigns submatrices to the most suitable core, complemented by kernel fusion and LOA layout optimization to improve data reuse and memory access. The approach delivers up to 1.33× average SpMM speedup and 1.23× average GNN training speedup across 14 real-world graphs, outperforming state-of-the-art CUDA-only and Tensor-core methods. This work enhances the practicality of large-scale GNN training by leveraging heterogeneous GPU architectures and optimizing data layout and kernel design.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in graph computing and analytics. However, the irregularity of real-world graphs poses significant challenges to achieving efficient SpMM operation for graph data on GPUs. Recently, significant advancements in GPU computing power and the introduction of new efficient computing cores within GPUs offer new opportunities for acceleration. In this paper, we present HC-SpMM, a pioneering algorithm that leverages hybrid GPU cores (Tensor cores and CUDA cores) to accelerate SpMM for graphs. To adapt to the computing characteristics of different GPU cores, we investigate the impact of sparse graph features on the performance of different cores, develop a data partitioning technique for the graph adjacency matrix, and devise a novel strategy for intelligently selecting the most efficient cores for processing each submatrix. Additionally, we optimize it by considering memory access and thread utilization, to utilize the computational resources to their fullest potential. To support complex graph computing workloads, we integrate HC-SpMM into the GNN training pipeline. Furthermore, we propose a kernel fusion strategy to enhance data reuse, as well as a cost-effective graph layout reorganization method to mitigate the irregular and sparse issues of real-world graphs, better fitting the computational models of hybrid GPU cores. Extensive experiments on 14 real-world graph datasets demonstrate that HC-SpMM achieves an average speedup of 1.33x and 1.23x over state-of-the-art SpMM kernels and GNN frameworks.

Paper Structure

This paper contains 34 sections, 6 equations, 17 figures, 16 tables, 6 algorithms.

Figures (17)

  • Figure 1: SpMM execution time with varying sparsity and non-zero columns.
  • Figure 2: SpMM computing procedure of CUDA cores and Tensor cores.
  • Figure 3: An example of GNN.
  • Figure 4: Combination strategies for SpMM on CUDA and Tensor cores.
  • Figure 5: Warp allocation strategies on different GPU cores.
  • ...and 12 more figures