Table of Contents
Fetching ...

Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient Modeling

Hussein Abdallah, Waleed Afandi, Panos Kalnis, Essam Mansour

TL;DR

This work tackles the high resource demands of training heterogeneous GNNs on large knowledge graphs by introducing KG-TOSA, an automated method to extract a task-oriented subgraph (TOSG) that preserves local and global task-relevant structure. The authors define a generic graph pattern that maximizes neighbor-type entropy while minimizing distance to target nodes, and propose three extraction approaches: biased random walk, influence-based sampling, and a SPARQL-based method that leverages RDF engine indices. Across a large benchmark of KGs and six state-of-the-art GNNs, KG-TOSA achieves up to 70% reductions in training time and memory with concurrent improvements in accuracy, demonstrating faster convergence and more efficient model deployment on big KGs. The SPARQL-based extraction, in particular, offers negligible preprocessing overhead while delivering comparable or better performance than sampling-based methods. Overall, KG-TOSA provides a scalable, practical pathway to deploy HGNNs in real-world, large-scale KG scenarios, with significant implications for data-intensive AI pipelines.

Abstract

A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.

Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient Modeling

TL;DR

This work tackles the high resource demands of training heterogeneous GNNs on large knowledge graphs by introducing KG-TOSA, an automated method to extract a task-oriented subgraph (TOSG) that preserves local and global task-relevant structure. The authors define a generic graph pattern that maximizes neighbor-type entropy while minimizing distance to target nodes, and propose three extraction approaches: biased random walk, influence-based sampling, and a SPARQL-based method that leverages RDF engine indices. Across a large benchmark of KGs and six state-of-the-art GNNs, KG-TOSA achieves up to 70% reductions in training time and memory with concurrent improvements in accuracy, demonstrating faster convergence and more efficient model deployment on big KGs. The SPARQL-based extraction, in particular, offers negligible preprocessing overhead while delivering comparable or better performance than sampling-based methods. Overall, KG-TOSA provides a scalable, practical pathway to deploy HGNNs in real-world, large-scale KG scenarios, with significant implications for data-intensive AI pipelines.

Abstract

A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.
Paper Structure (27 sections, 3 equations, 9 figures, 4 tables, 3 algorithms)

This paper contains 27 sections, 3 equations, 9 figures, 4 tables, 3 algorithms.

Figures (9)

  • Figure 1: (A) Accuracy (higher is better), (B) Training-Time (lower is better), (C) Training-Memory (lower is better). Training a node classification task ($PV$) to predict the paper venue using ShaDowSAINT Shadow-GNN and SeHGNN SeHGNN on a MAG graph with 42M vertices ($MAG$$-$$42M$). The handcrafted task-oriented subgraph ($OGBN$$-$$MAG$) from $MAG$$-$$42M$ trades accuracy to reduce time and memory usage. Our KG-TOSA$_{d1h1}$ task-oriented subgraph is extracted automatically from $MAG$$-$$42M$ for $PV$ to reduce time and memory consumption while improving accuracy.
  • Figure 2: Examples of subgraphs generated by the uniform random walk (URW) sampler in GraphSAINT for different GNN tasks. The black vertices indicate the target vertices. The same colour means the same vertex or edge type. This sampling method does not guarantee to (i) include enough representation of target vertices ($\mathcal{V}_T$) and (ii) exclude vertices disconnected from $\mathcal{V}_T$ as these vertices do not contribute to the embeddings of $\mathcal{V}_T$.
  • Figure 3: The TOSG's generic graph pattern is based on two parameters: (i) the direction (outgoing and incoming) predicates, and (i) the number of hops.
  • Figure 4: The KG-TOSA generic workflow. The TOSG ($KG'$) of Task $\mathcal{A}$ is extracted and transformed into an adjacency matrix. Then, the HGNN training performs either mini-batch or full-batch training. The size of $KG'$ is proportional to $\mathcal{V}_T$, the average degree of $\mathcal{V}_T$, and the average distance to $\mathcal{V}_T$. The smaller $KG'$, the faster the training pipeline.
  • Figure 5: The subgraphs (a, b, and c) are generated by our biased random walk sampler. The black vertices indicate the target vertices. Node and edge types are colour-coded. Our approach leads to a higher ratio of target vertices w.r.t. RW in Figure \ref{['fig:RW_subgraphQuality']} while the non-target vertices of different node/edge types are reachable to at least one vertex in $\mathcal{V}_T$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 2.1: Knowledge Graph
  • Definition 2.2: Node Classification
  • Definition 2.3: Link Prediction
  • Definition 3.1: Task-oriented Subgraph for Training HGNNs