A Pure Transformer Pretraining Framework on Text-attributed Graphs

Yu Song; Haitao Mao; Jiachen Xiao; Jingzhe Liu; Zhikai Chen; Wei Jin; Carl Yang; Jiliang Tang; Hui Liu

A Pure Transformer Pretraining Framework on Text-attributed Graphs

Yu Song, Haitao Mao, Jiachen Xiao, Jingzhe Liu, Zhikai Chen, Wei Jin, Carl Yang, Jiliang Tang, Hui Liu

TL;DR

This work tackles cross-graph transfer in graphs by shifting from structure-centric pretraining to a feature-centric paradigm that leverages unified text-based node representations produced by LLMs. It introduces GSPT, which uses random-walk contexts and a standard Transformer to perform masked feature reconstruction in a cosine-based objective, enabling transfer across graphs within the same domain. Pretraining on a massive graph (ogbn-papers100M) and downstream evaluation on Cora, Citeseer, Pubmed, and Arxiv23 show that GSPT enables strong in-context learning and superior transfer, particularly when class descriptions are provided. The findings highlight the potential of LLM-unified feature spaces for scalable graph foundation models and provide practical guidance for building transferable graph representations without heavy dependence on graph structure.

Abstract

Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges such as feature heterogeneity and structural heterogeneity. Recently, increasing efforts have been made to enhance node feature quality with Large Language Models (LLMs) on text-attributed graphs (TAGs), demonstrating superiority to traditional bag-of-words or word2vec techniques. These high-quality node features reduce the previously critical role of graph structure, resulting in a modest performance gap between Graph Neural Networks (GNNs) and structure-agnostic Multi-Layer Perceptrons (MLPs). Motivated by this, we introduce a feature-centric pretraining perspective by treating graph structure as a prior and leveraging the rich, unified feature space to learn refined interaction patterns that generalizes across graphs. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks and employs masked feature reconstruction to capture pairwise proximity in the LLM-unified feature space using a standard Transformer. By utilizing unified text representations rather than varying structures, our framework achieves significantly better transferability among graphs within the same domain. GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.

A Pure Transformer Pretraining Framework on Text-attributed Graphs

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 5 figures, 11 tables)

This paper contains 28 sections, 9 equations, 5 figures, 11 tables.

Introduction
Preliminary Study
Method
An Overview
Context Construction
Transformer as Backbone
Masked Feature Reconstruction
Negative sampling
Enabling In-context Learning
Experiment
Datasets
Few-shot Node Classification
Experimental Setup
Result Comparison
Analysis of GSPT's in-context capability
...and 13 more sections

Figures (5)

Figure 1: (a) SentenceBert provides a unified feature space for different datasets under the same domain. The node features of three small citation graphs, i.e., Cora, Citeseer, and Pubmed, can be well covered by ogbn-papers100M, a large-scale citation network containing papers from a vast variety of research topics. (b) Advanced text embeddings are better at predicting the missing edges compared with shallow features. (c) Advanced embeddings reduce the performance gap on node classification between GCN and MLP. Experiments are conducted on Cora.
Figure 2: The overall framework of Graph Sequence Pretraining with Transformer (GSPT). Left: the pretraining consists of four steps: (1) generate node sequences from the graph using random walk; (2) randomly replace a portion of node features with [MASK]; (3) feed the input sequence into the Transformer and (4) compute the feature reconstruction loss with cosine similarity. Right: We construct the augmented graph by adding class nodes to the original graph and connecting correponding node pairs. GSPT performs in-context node classification by comparing the cosine similarity between the representations of regular nodes and class nodes.
Figure 3: Attention map on classes of Cora. Left: attention weights obtained by the pretrained Transformer. Right: attention weights w/o pretraining.
Figure 4: Ablation studies of different negative sampling strategies.
Figure 5: Scaling effect of GSPT. (a) Node classification performance on downstream datasets with linear probing. (b) Link prediction performance on downstream datasets via fine-tuning. X-axis denotes the number of METIS graphs used for pretraining. Empirically, GSPT improves as adding more data to pretraining.

A Pure Transformer Pretraining Framework on Text-attributed Graphs

TL;DR

Abstract

A Pure Transformer Pretraining Framework on Text-attributed Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)