Table of Contents
Fetching ...

Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

Kun Wu, Mert Hidayetoğlu, Xiang Song, Sitao Huang, Da Zheng, Israt Nisa, Wen-mei Hwu

TL;DR

Hector tackles the memory- and data-movement bottlenecks of Relational Graph Neural Networks on GPUs by introducing a two-level intermediate representation and an automated code generator. The framework decouples model semantics, data layout, and operator-specific optimization, and relies on GEMM- and node/edge traversal-based templates to produce highly efficient kernels. It demonstrates up to $9.9\times$ inference and $43.7\times$ training speed-ups over state-of-the-art systems on heterogeneous graphs, with no observed OOM errors, and adds compact materialization and linear operator reordering to push further gains. By enabling rapid translation from small PyG/DGL inputs (around 51 lines) to large generated CUDA/C++ code (around 8K lines), Hector significantly reduces programming effort while delivering strong practical impact for RGNN workloads on diverse graph data.

Abstract

Relational graph neural networks (RGNNs) are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges: inherent memory-intensive computation patterns, the gap between the programming interface and kernel APIs, and heavy programming effort in optimizing kernels caused by their coupling with data layout and heterogeneity. To systematically address these challenges, we propose Hector, a novel two-level intermediate representation and its code generator framework, that (a) captures the key properties of RGNN models, and opportunities to reduce memory accesses in inter-operator scheduling and materialization, (b) generates code with flexible data access scheme to eliminate redundant data copies, (c) decouples model semantics, data layout, and operators-specific optimization from each other to reduce programming effort. By building on one general matrix multiply (GEMM) template and a node/edge traversal template, Hector achieves up to 9.9x speed-up in inference and 43.7x speed-up in training compared with the state-of-the-art public systems on select models, i.e., RGCN, RGAT and HGT, when running heterogeneous graphs provided by Deep Graph Library (DGL) and Open Graph Benchmark (OGB). In addition, Hector does not trigger any out-of-memory (OOM) exception in these tests. We also propose the linear operator reorder and compact materialization to further accelerate the system by up to 3.8x. As an indicator of programming effort reduction, Hector takes in 51 lines of code expressing the three models and generates a total of 8K lines of CUDA and C++ code.

Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

TL;DR

Hector tackles the memory- and data-movement bottlenecks of Relational Graph Neural Networks on GPUs by introducing a two-level intermediate representation and an automated code generator. The framework decouples model semantics, data layout, and operator-specific optimization, and relies on GEMM- and node/edge traversal-based templates to produce highly efficient kernels. It demonstrates up to inference and training speed-ups over state-of-the-art systems on heterogeneous graphs, with no observed OOM errors, and adds compact materialization and linear operator reordering to push further gains. By enabling rapid translation from small PyG/DGL inputs (around 51 lines) to large generated CUDA/C++ code (around 8K lines), Hector significantly reduces programming effort while delivering strong practical impact for RGNN workloads on diverse graph data.

Abstract

Relational graph neural networks (RGNNs) are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges: inherent memory-intensive computation patterns, the gap between the programming interface and kernel APIs, and heavy programming effort in optimizing kernels caused by their coupling with data layout and heterogeneity. To systematically address these challenges, we propose Hector, a novel two-level intermediate representation and its code generator framework, that (a) captures the key properties of RGNN models, and opportunities to reduce memory accesses in inter-operator scheduling and materialization, (b) generates code with flexible data access scheme to eliminate redundant data copies, (c) decouples model semantics, data layout, and operators-specific optimization from each other to reduce programming effort. By building on one general matrix multiply (GEMM) template and a node/edge traversal template, Hector achieves up to 9.9x speed-up in inference and 43.7x speed-up in training compared with the state-of-the-art public systems on select models, i.e., RGCN, RGAT and HGT, when running heterogeneous graphs provided by Deep Graph Library (DGL) and Open Graph Benchmark (OGB). In addition, Hector does not trigger any out-of-memory (OOM) exception in these tests. We also propose the linear operator reorder and compact materialization to further accelerate the system by up to 3.8x. As an indicator of programming effort reduction, Hector takes in 51 lines of code expressing the three models and generates a total of 8K lines of CUDA and C++ code.
Paper Structure (10 sections, 1 equation, 7 figures, 2 tables)

This paper contains 10 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The forward propagation of an RGCN layer could be divided into ① message generation on edges and ② node aggregation. We focus on paper node $z$ in a large citation graph as an example. $z$ only has two incoming edges, from $a$ and $b$, respectively. $\overrightarrow{{h}^{(in)}}$ and $\overrightarrow{{h}^{(out)}}$ are node features. $W_{writes}$ is the weight for the type "writes". $W_{0}$ is the weight for virtual self-loops. $\sigma$ is the activation function. Notably, some runtime implementations may replicate data, e.g., $W_{writes}$.
  • Figure 2: HGT and RGAT layer. $\overrightarrow{{h}_n}$ and $\overrightarrow{{h}_n^\prime}$ are node $n$'s features. Denote the type of edge from $a$ to $z$ as $\tau(a\rightarrow z)$. Weights $W_{a,\tau(a\rightarrow z)}$ differ by edge type $\tau(a\rightarrow z)$: For example, assuming there are two edge types, "writes" and "cites", $W_{a,\text{"writes"}}$ is a different weight from $W_{a,\text{"cites"}}$. They are defined and learnt according to the edge type. $W_{m,\tau(a\rightarrow z)}$ and $\overrightarrow{{w}_{a,\tau(a\rightarrow z)}}$ are in similar situations. Weights $W_{\tau(n)}$ differ by the node type $\tau(n)$ of $n$. $\sigma$ is a leaky rectified linear unit (ReLU) in the case of RGAT. $\sigma_{sm}$ stands for edge softmax. $[\vec{s};\vec{t}]$ concatenates $\vec{s}, \vec{t}$.
  • Figure 3: Breakdown of inference time by Graphiler and Hector. Matrix multiply (MM) includes SpMM. We categorize PyTorch time not accounted for by kernels as "PyTorch Other Compute".
  • Figure 4: Inefficiency (in red) exists in all layers of existing systems.
  • Figure 5: Hector workflow and software architecture.
  • ...and 2 more figures