Table of Contents
Fetching ...

Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin Zhang, Xiaochun Ye, Dongrui Fan

TL;DR

This work tackles the inefficiency of GPU utilization in mini-batch HGNN training caused by numerous short, memory-bound CUDA kernels. It introduces HiFuse, a PyG extension that reorganizes and merges vertex features to enable larger, fewer kernels, and offloads CPU-intensive edge-index selection with parallelization and asynchronous pipelining to balance CPU-GPU workloads. The approach yields an average speedup of $2.38\times$ over the state-of-the-art PyG framework, with significant kernel-reduction and data-locality gains across multiple datasets and HGNN models. This CPU-GPU co-design and data-flow optimization enhances practical scalability of HGNNs on real-world heterogeneous graphs.

Abstract

Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to numerous short-execution-time and memory-bound CUDA kernels during HGNN training. To address this issue, we introduce HiFuse, an enhancement for PyTorch Geometric designed to accelerate mini-batch HGNN training on CPU-GPU systems. From the data perspective, we reorganize and merge multiple smaller vertex feature matrices into larger ones, enabling a single kernel to process larger data chunks. This efficiently exploits data locality, reduces the kernel launch overhead, and improves overall GPU utilization. From the workflow perspective, we sophisticatedly offload the construction of semantic graphs from GPU to CPU to reduce the number of CUDA kernels. To meet the parallelism requirements on CPU and ensure seamless execution between CPU and GPU, we employ parallelization techniques including multi-threading and asynchronous pipeline. This allows different stages of the process to overlap, enhancing GPU utilization and reducing end-to-end execution latency, leading to a more efficient and balanced use of computational resources. Through extensive experiments, HiFuse demonstrates an average 2.38 times speedup compared to a state-of-the-art solution.

Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

TL;DR

This work tackles the inefficiency of GPU utilization in mini-batch HGNN training caused by numerous short, memory-bound CUDA kernels. It introduces HiFuse, a PyG extension that reorganizes and merges vertex features to enable larger, fewer kernels, and offloads CPU-intensive edge-index selection with parallelization and asynchronous pipelining to balance CPU-GPU workloads. The approach yields an average speedup of over the state-of-the-art PyG framework, with significant kernel-reduction and data-locality gains across multiple datasets and HGNN models. This CPU-GPU co-design and data-flow optimization enhances practical scalability of HGNNs on real-world heterogeneous graphs.

Abstract

Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to numerous short-execution-time and memory-bound CUDA kernels during HGNN training. To address this issue, we introduce HiFuse, an enhancement for PyTorch Geometric designed to accelerate mini-batch HGNN training on CPU-GPU systems. From the data perspective, we reorganize and merge multiple smaller vertex feature matrices into larger ones, enabling a single kernel to process larger data chunks. This efficiently exploits data locality, reduces the kernel launch overhead, and improves overall GPU utilization. From the workflow perspective, we sophisticatedly offload the construction of semantic graphs from GPU to CPU to reduce the number of CUDA kernels. To meet the parallelism requirements on CPU and ensure seamless execution between CPU and GPU, we employ parallelization techniques including multi-threading and asynchronous pipeline. This allows different stages of the process to overlap, enhancing GPU utilization and reducing end-to-end execution latency, leading to a more efficient and balanced use of computational resources. Through extensive experiments, HiFuse demonstrates an average 2.38 times speedup compared to a state-of-the-art solution.
Paper Structure (17 sections, 11 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Illustrations for heterogeneous graph and HGNN.
  • Figure 2: Workflow of mini-batch HGNN training.
  • Figure 3: Frequent kernel launches of many short-execution-time and memory-bound kernels in semantic graph build and neighbor aggregation stages on RGCN model with AM dataset: (a) Partial timeline showing CUDA kernels' activity; (b) Roofline model of GPU FP32 performance showing CUDA kernels' execution bound.
  • Figure 4: Data organization and accesses to vertex features in neighbor aggregation: (a) Features organized by vertex index first; (b) Features organized by vertex type first.
  • Figure 5: Neighbor aggregation (a) without feature merging and (b) with feature merging.
  • ...and 6 more figures