Table of Contents
Fetching ...

gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs

Humera Sabir, Fatima Farooq, Ashraf Aboulnaga

TL;DR

gHAWK targets the scalability bottleneck of training GNNs on large, heterogeneous knowledge graphs by shifting structural modeling from iterative message passing to a preprocessing step. It precomputes local neighborhood information with Bloom filters and global structure via TransE embeddings, then fuses these priors with domain features to form robust, fixed node representations that feed into any GNN backbone. The approach yields consistent accuracy gains and faster convergence across node property prediction and link prediction benchmarks, achieving state-of-the-art results on several Open Graph Benchmark graphs while reducing memory and training time. By enabling effective use of shallow, scalable GNNs and flexible decoders, gHAWK offers a practical pathway for web-scale KG reasoning without prohibitive parameter overheads.

Abstract

Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node's global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.

gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs

TL;DR

gHAWK targets the scalability bottleneck of training GNNs on large, heterogeneous knowledge graphs by shifting structural modeling from iterative message passing to a preprocessing step. It precomputes local neighborhood information with Bloom filters and global structure via TransE embeddings, then fuses these priors with domain features to form robust, fixed node representations that feed into any GNN backbone. The approach yields consistent accuracy gains and faster convergence across node property prediction and link prediction benchmarks, achieving state-of-the-art results on several Open Graph Benchmark graphs while reducing memory and training time. By enabling effective use of shallow, scalable GNNs and flexible decoders, gHAWK offers a practical pathway for web-scale KG reasoning without prohibitive parameter overheads.

Abstract

Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node's global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.

Paper Structure

This paper contains 57 sections, 28 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: gHAWK pipeline. (a) Input knowledge graph. (b) Node feature computation: in a preprocessing step, gHAWK constructs a Bloom filter encoding the 1-hop neighbors of each node and trains a TransE embedding model to capture the global graph structure. (c) Feature fusion: for every node, the Bloom filter, TransE embedding, and any domain-specific feature vector are fused in the preprocessing step via an MLP, creating a dense fused feature vector. (d) Mini-batch message passing: the fused vectors initialize a message-passing GNN that is trained on sampled mini-batch subgraphs of the knowledge graph. (e) Task-specific heads: the resulting node embeddings are fed into separate heads for node property prediction, where embeddings directly predict node labels via a supervised loss, and for link prediction, where node embeddings are combined with relation embeddings in a decoder to score candidate triples.
  • Figure 2: Local and global view in gHAWK. Left: a target node $n_1$ and its 1-hop neighbors $\{n_2,\dots,n_6\}$ with relation types $\{r_1,r_2,r_3\}$. The neighbors are hashed $k$ times into a fixed-length $m$-bit Bloom filter, producing a signature that preserves the neighborhood. Right: the same node shown as a head node $h$ in a two-dimensional projection of the TransE embedding space. The relation vector $r$ is a translation from $h$ to tail $t$, illustrating $\mathbf{h}+\mathbf{r}\ \approx\ \mathbf{t}$. The node embedding $h$ encodes the position of the node within the graph global structure and captures the node's semantic context.
  • Figure 3: Feature-fusion module in gHAWK. For each node $i$, the frozen Bloom filter $B[i]$, TransE embedding $\mathbf{e}_i$, and optional domain-specific feature vector $\mathbf{x}_i$ are passed through dedicated two-layer MLPs $g_B$, $g_E$, and $g_X$ into a common $d$-dimensional space, concatenated into $\mathbf{z}_i$, and mapped by $\mathbf{W}_{\text{in}}$ and a nonlinear activation function $\phi(\cdot)$ into the initial GNN input embedding $\mathbf{h}^{(0)}_i$. All parameters in the projection MLPs and fusion layer are learnable and are updated jointly with the GNN and decoder during training.
  • Figure 4: Model convergence on MAG240M. The plots show validation error versus training time (hours) for each GNN backbone. Solid curves represent gHAWK+Text (Bloom+TransE+RoBERTa) and dotted curves represent Text (RoBERTa). X markers denote the final validation error, with the numeric label showing its value.