Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

Yufei He; Zhenyu Hou; Yukuo Cen; Jun Hu; Feng He; Xu Cheng; Jie Tang; Bryan Hooi

Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

Yufei He, Zhenyu Hou, Yukuo Cen, Jun Hu, Feng He, Xu Cheng, Jie Tang, Bryan Hooi

TL;DR

PGT tackles the problem of generalizing graph pre-training across diverse, web-scale graphs by introducing a scalable transformer-based framework with Masked Graph Modeling objectives. It leverages Personalized PageRank sampling to form context sequences, uses a transformer encoder with two pre-training tasks (feature reconstruction and local structure reconstruction), and reuses a pre-trained decoder for feature augmentation during inference. Empirical results on public benchmarks and Tencent data show state-of-the-art performance and strong cross-graph transfer, including a dynamic extension (PGT-Dynamic) that surpasses specialized dynamic models. The work demonstrates practical scalability and broad applicability, suggesting a path toward universal graph foundation models for industrial and real-world domains.

Abstract

Graph pre-training has been concentrated on graph-level tasks involving small graphs (e.g., molecular graphs) or learning node representations on a fixed graph. Extending graph pre-trained models to web-scale graphs with billions of nodes in industrial scenarios, while avoiding negative transfer across graphs or tasks, remains a challenge. We aim to develop a general graph pre-trained model with inductive ability that can make predictions for unseen new nodes and even new graphs. In this work, we introduce a scalable transformer-based graph pre-training framework called PGT (Pre-trained Graph Transformer). Based on the masked autoencoder architecture, we design two pre-training tasks: one for reconstructing node features and the other for reconstructing local structures. Unlike the original autoencoder architecture where the pre-trained decoder is discarded, we propose a novel strategy that utilizes the decoder for feature augmentation. Our framework, tested on the publicly available ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges, achieves state-of-the-art performance, showcasing scalability and efficiency. We have deployed our framework on Tencent's online game data, confirming its capability to pre-train on real-world graphs with over 540 million nodes and 12 billion edges and to generalize effectively across diverse static and dynamic downstream tasks.

Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

TL;DR

Abstract

Paper Structure (29 sections, 11 equations, 6 figures, 18 tables)

This paper contains 29 sections, 11 equations, 6 figures, 18 tables.

Introduction
Related work
Scalable Graph Neural Networks
Graph Transformers
Pre-training on Graphs.
Learning on Dynamic Graphs
Preliminaries
Graph Pre-training: Challenges
Negative Transfer in Graph Pre-training
The PGT Framework
Graph to Node Sequences
Masked Encoding
Pre-training Tasks
Decoder Reuse for Feature Augmentation
Performance on Public Benchmarks
...and 14 more sections

Figures (6)

Figure 1: Illustrations of graph pre-training in the online gaming industry.
Figure 2: Overview of PGT framework during pre-training phase. (1) Given a seed node $v_s$, we adopt the personalized PageRank (PPR) algorithm to sample a contextual node sequence $S_s$ to represent its local graph structure. (2) Subsequently, we randomly mask a subset of nodes in each sequence and feed the sequence composed of the unmasked nodes into the graph transformer encoder. The output consists of the embeddings of each unmasked node. (3) The decoding comprises two learning objectives: a) we incorporate the masked nodes into the output of the encoder and initialize them with a learnable token $[M]$. A shallow graph transformer is employed as a decoder to reconstruct the input features of the masked nodes. b) In the context of graphs, each sequence $S_s$ can be interpreted as a local neighborhood or subgraph. We adopt a simple MLP and contrastive loss aiming to make nodes within each sequence similar in the latent space while making them dissimilar to nodes from other sequences. This encourages the learned embeddings to reflect the actual connectivity patterns within local graph structures.
Figure 3: Step-by-step illustration of the PPR sampling process: (1) Original graph with seed node, (2) Computation of Personalized PageRank scores, (3) Selection of top-$k$ nodes, and (4) Final sequence construction for transformer input.
Figure 4: Comparison of PPR sampling for different tasks. Top row: Node classification task with a single seed node. Bottom row: Link prediction task with source and target nodes, showing how contexts are sampled from both endpoints.
Figure 5: Comparison of existing methods and the proposed PGT framework during the inference phase. Our contributions are: a) we use fast approximations of PPR to select informative auxiliary nodes, instead of employing layer-by-layer full neighborhood aggregation. b) We propose a lightweight feature augmentation strategy that requires no additional training. Using the pre-trained encoder and decoder to perform a single forward pass generates reconstructed features, which, when averaged with the original features, serve as the input for downstream tasks.
...and 1 more figures

Theorems & Definitions (1)

Definition 1: Negative Transfer in Graph Pre-training

Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

TL;DR

Abstract

Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)