Plain Transformers Can be Powerful Graph Learners
Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates
TL;DR
This paper demonstrates that plain Transformers can be powerful graph learners by introducing three minimal modifications: AdaRMSN normalization, simplified $L_2$ attention (s$L_2$) that blends angle and magnitude information, and an MLP-based stem for graph positional encoding, using RRWP as the graph PE. The resulting Powerful Plain Graph Transformers (PPGT) achieve strong empirical expressivity on the GD-WL framework, outperforming more complex GTs on synthetic isomorphism benchmarks and delivering top performance on multiple real-world graph datasets, including large-scale PCQM4Mv2, while preserving the simplicity of vanilla Transformers. Thework also provides ablations that highlight the value of preserving token magnitude information, the benefits of SPE, and the viability of plain Transformer architectures for graph tasks. Overall, PPGT offers a practical, scalable path toward unifying graph learning with other modalities under plain Transformer paradigms, with potential implications for multi-modal foundation models. The main limitations are quadratic time/space complexity and scalability concerns for very large graphs, discussed further in the appendix.
Abstract
Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated competitors such as subgraph GNNs and higher-order GNNs. Its outstanding empirical performance across various graph datasets also justifies the practical effectiveness of PPGT.
