gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs
Humera Sabir, Fatima Farooq, Ashraf Aboulnaga
TL;DR
gHAWK targets the scalability bottleneck of training GNNs on large, heterogeneous knowledge graphs by shifting structural modeling from iterative message passing to a preprocessing step. It precomputes local neighborhood information with Bloom filters and global structure via TransE embeddings, then fuses these priors with domain features to form robust, fixed node representations that feed into any GNN backbone. The approach yields consistent accuracy gains and faster convergence across node property prediction and link prediction benchmarks, achieving state-of-the-art results on several Open Graph Benchmark graphs while reducing memory and training time. By enabling effective use of shallow, scalable GNNs and flexible decoders, gHAWK offers a practical pathway for web-scale KG reasoning without prohibitive parameter overheads.
Abstract
Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node's global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.
