Table of Contents
Fetching ...

GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU

Zhongming Yu, Genghan Zhang, Hanxian Huang, Xin Chen, Jishen Zhao

TL;DR

GeoT presents a tensor-centric library for Graph Neural Networks focused on efficient segment reduction. It introduces a tunable hierarchical tiling space, a complete SR/PR reduction space, and data-aware kernel configuration via a lightweight decision-tree model, plus format-agnostic fusion to merge messaging and aggregation. The approach yields substantial performance gains, including approximately 1.28x–1.68x speedups in operator and end-to-end tasks and strong portability across A100, H100, and RTX 3090 Ti GPUs. These results demonstrate the viability of deploying tensor-centric optimizations for geometric deep learning and pave the way for integration with compilers and end-to-end ML systems. GeoT thus offers a practical path to accelerate GNN workloads without reliance on graph-format constraints.

Abstract

In recent years, Graph Neural Networks (GNNs) have ignited a surge of innovation, significantly enhancing the processing of geometric data structures such as graphs, point clouds, and meshes. As the domain continues to evolve, a series of frameworks and libraries are being developed to push GNN efficiency to new heights. While graph-centric libraries have achieved success in the past, the advent of efficient tensor compilers has highlighted the urgent need for tensor-centric libraries. Yet, efficient tensor-centric frameworks for GNNs remain scarce due to unique challenges and limitations encountered when implementing segment reduction in GNN contexts. We introduce GeoT, a cutting-edge tensor-centric library designed specifically for GNNs via efficient segment reduction. GeoT debuts innovative parallel algorithms that not only introduce new design principles but also expand the available design space. Importantly, GeoT is engineered for straightforward fusion within a computation graph, ensuring compatibility with contemporary tensor-centric machine learning frameworks and compilers. Setting a new performance benchmark, GeoT marks a considerable advancement by showcasing an average operator speedup of 1.80x and an end-to-end speedup of 1.68x.

GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU

TL;DR

GeoT presents a tensor-centric library for Graph Neural Networks focused on efficient segment reduction. It introduces a tunable hierarchical tiling space, a complete SR/PR reduction space, and data-aware kernel configuration via a lightweight decision-tree model, plus format-agnostic fusion to merge messaging and aggregation. The approach yields substantial performance gains, including approximately 1.28x–1.68x speedups in operator and end-to-end tasks and strong portability across A100, H100, and RTX 3090 Ti GPUs. These results demonstrate the viability of deploying tensor-centric optimizations for geometric deep learning and pave the way for integration with compilers and end-to-end ML systems. GeoT thus offers a practical path to accelerate GNN workloads without reliance on graph-format constraints.

Abstract

In recent years, Graph Neural Networks (GNNs) have ignited a surge of innovation, significantly enhancing the processing of geometric data structures such as graphs, point clouds, and meshes. As the domain continues to evolve, a series of frameworks and libraries are being developed to push GNN efficiency to new heights. While graph-centric libraries have achieved success in the past, the advent of efficient tensor compilers has highlighted the urgent need for tensor-centric libraries. Yet, efficient tensor-centric frameworks for GNNs remain scarce due to unique challenges and limitations encountered when implementing segment reduction in GNN contexts. We introduce GeoT, a cutting-edge tensor-centric library designed specifically for GNNs via efficient segment reduction. GeoT debuts innovative parallel algorithms that not only introduce new design principles but also expand the available design space. Importantly, GeoT is engineered for straightforward fusion within a computation graph, ensuring compatibility with contemporary tensor-centric machine learning frameworks and compilers. Setting a new performance benchmark, GeoT marks a considerable advancement by showcasing an average operator speedup of 1.80x and an end-to-end speedup of 1.68x.
Paper Structure (29 sections, 1 equation, 11 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: A simplified illustration of message-passing GNNs schedule. (a) An example graph with 4 nodes and 5 edges. (b) The message stage of GNN corresponded to Eq. \ref{['eq:message']} (c) The aggregation stage corresponded to Eq. \ref{['eq:aggregate']}. (d) The update step corresponded to Eq. \ref{['eq:update']}.
  • Figure 2: An illustration of segment reduce operation. (a) M-loop of segment reduction. (b) N-loop of segment reduction. (c) The aggregation's type of reduction $f$ can be varied, such as using mean or max. But the most commonly used aggregation type is sum. (d) A pseudocode of segment reduction.
  • Figure 3: Design principles for efficient segment reduction include (a) Tunable Hierarchical Tiling Space, utilizing block and thread group tiling for a hierarchical design; (b) Within the thread group's M-loop subtask, employing SR and PR reduction strategies, with PR regulated by a distinct parameter $G_t$ for synchronized thread group management; (c) A decision tree is applied to determine the optimal segment reduction rules, and eventually the code config is automatically generated.
  • Figure 4: (a) The GFlops heatmaps for the datasets Amazon-Photo and Ogbn-Arxiv highlight the impact of varying configuration settings on performance. Specifically, two configurations, $T_N$ and $M_t$, are chosen to illustrate how different config settings can influence computational efficiency. (b) The representation of average GFlops across augmented datasets about SR and PR methods provides insights into the potential trade-offs between these two reduction strategies.
  • Figure 5: The process flow of data-aware configuration. The key idea of this part is to utilize a performance database to train more efficient decision-tree rules for segment reduction. The decision tree is transformed into kernel configs through code generation, facilitating the compilation of the final .so library for GPUs.
  • ...and 6 more figures