Table of Contents
Fetching ...

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

Ka Wai Wu

TL;DR

FTC-GNN introduces a collaborative CUDA-Cores and Tensor Cores framework to accelerate sparse GNNs on GPUs. It employs a sparse graph transformation that converts irregular sparse graphs into dense blocks and a set of kernel designs for sparse neighbor aggregation and edge feature computation, exploiting TCUs for GEMM while CUDA Cores handle memory and data management. Experimental results on GCN and AGNN across multiple datasets show substantial speedups over DGL and PyG (and competitive performance with TC-GNN), demonstrating improved GPU resource utilization and throughput. The work highlights the practicality of Tensor Core–driven acceleration for large-scale sparse graphs and outlines directions for data storage optimization, broader model support, and hardware-aware enhancements.

Abstract

Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which are insufficient to meet the performance demands of GNNs. Recent research has explored parallel acceleration using CUDA Cores and Tensor Cores, but significant challenges persist: (1) kernel fusion leads to false high utilization, failing to treat CUDA and Tensor Cores as independent resources, and (2) heterogeneous cores have distinct computation preferences, causing inefficiencies. To address these issues, this paper proposes FTC-GNN, a novel acceleration framework that efficiently utilizes CUDA and Tensor Cores for GNN computation. FTC-GNN introduces (1) a collaborative design that enables the parallel utilization of CUDA and Tensor Cores and (2) a sparse-to-dense transformation strategy that assigns dense matrix operations to Tensor Cores while leveraging CUDA Cores for data management and sparse edge processing. This design optimizes GPU resource utilization and improves computational efficiency. Experimental results demonstrate the effectiveness of FTC-GNN using GCN and AGNN models across various datasets. For GCN, FTC-GNN achieves speedups of 4.90x, 7.10x, and 1.17x compared to DGL, PyG, and TC-GNN, respectively. For AGNN, it achieves speedups of 5.32x, 2.92x, and 1.02x, establishing its superiority in accelerating GNN computations.

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

TL;DR

FTC-GNN introduces a collaborative CUDA-Cores and Tensor Cores framework to accelerate sparse GNNs on GPUs. It employs a sparse graph transformation that converts irregular sparse graphs into dense blocks and a set of kernel designs for sparse neighbor aggregation and edge feature computation, exploiting TCUs for GEMM while CUDA Cores handle memory and data management. Experimental results on GCN and AGNN across multiple datasets show substantial speedups over DGL and PyG (and competitive performance with TC-GNN), demonstrating improved GPU resource utilization and throughput. The work highlights the practicality of Tensor Core–driven acceleration for large-scale sparse graphs and outlines directions for data storage optimization, broader model support, and hardware-aware enhancements.

Abstract

Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which are insufficient to meet the performance demands of GNNs. Recent research has explored parallel acceleration using CUDA Cores and Tensor Cores, but significant challenges persist: (1) kernel fusion leads to false high utilization, failing to treat CUDA and Tensor Cores as independent resources, and (2) heterogeneous cores have distinct computation preferences, causing inefficiencies. To address these issues, this paper proposes FTC-GNN, a novel acceleration framework that efficiently utilizes CUDA and Tensor Cores for GNN computation. FTC-GNN introduces (1) a collaborative design that enables the parallel utilization of CUDA and Tensor Cores and (2) a sparse-to-dense transformation strategy that assigns dense matrix operations to Tensor Cores while leveraging CUDA Cores for data management and sparse edge processing. This design optimizes GPU resource utilization and improves computational efficiency. Experimental results demonstrate the effectiveness of FTC-GNN using GCN and AGNN models across various datasets. For GCN, FTC-GNN achieves speedups of 4.90x, 7.10x, and 1.17x compared to DGL, PyG, and TC-GNN, respectively. For AGNN, it achieves speedups of 5.32x, 2.92x, and 1.02x, establishing its superiority in accelerating GNN computations.

Paper Structure

This paper contains 33 sections, 2 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Computational Process of GNN
  • Figure 2: Active Scheduling of Tensor Cores and CUDA Cores
  • Figure 3: Dense Block Representation
  • Figure 4: GraphSAGE Sampling Illustration
  • Figure 5: CSR Representation of a Matrix
  • ...and 10 more figures