Table of Contents
Fetching ...

Characterizing and Understanding HGNN Training on GPUs

Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan

TL;DR

This study conducts a comprehensive quantification and in-depth analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training, and reveals the performance bottlenecks and their underlying causes in different HGNN training scenarios.

Abstract

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to their practical application, identifying the optimal HGNN model parameters tailored to specific tasks through extensive training is a time-consuming and costly process. To enhance the efficiency of HGNN training, it is essential to characterize and analyze the execution semantics and patterns within the training process to identify performance bottlenecks. In this study, we conduct an in-depth quantification and analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training. Based on the characterization results, we disclose the performance bottlenecks and their underlying causes in different HGNN training scenarios and provide optimization guidelines from both software and hardware perspectives.

Characterizing and Understanding HGNN Training on GPUs

TL;DR

This study conducts a comprehensive quantification and in-depth analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training, and reveals the performance bottlenecks and their underlying causes in different HGNN training scenarios.

Abstract

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to their practical application, identifying the optimal HGNN model parameters tailored to specific tasks through extensive training is a time-consuming and costly process. To enhance the efficiency of HGNN training, it is essential to characterize and analyze the execution semantics and patterns within the training process to identify performance bottlenecks. In this study, we conduct an in-depth quantification and analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training. Based on the characterization results, we disclose the performance bottlenecks and their underlying causes in different HGNN training scenarios and provide optimization guidelines from both software and hardware perspectives.
Paper Structure (50 sections, 14 figures, 5 tables)

This paper contains 50 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Illustration of HetGs and HGNNs.
  • Figure 2: Illustration of HGNN training: (a) SGB stage; (b) Mini-batch sampling process; (c) Training process on a single computing node; (d) Distributed training process.
  • Figure 3: Time breakdown of HGNN training by phase: (a) The whole training process; (b) Forward; (c) Backward.
  • Figure 4: Time breakdown of HGNN training by kernel: (a) Forward; (b) Backward ("NONE" indicates that there are no CUDA kernels invoked here).
  • Figure 5: The roofline model for kernels under single-precision floating-point operations.
  • ...and 9 more figures