Table of Contents
Fetching ...

Plain Transformers Can be Powerful Graph Learners

Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates

TL;DR

This paper demonstrates that plain Transformers can be powerful graph learners by introducing three minimal modifications: AdaRMSN normalization, simplified $L_2$ attention (s$L_2$) that blends angle and magnitude information, and an MLP-based stem for graph positional encoding, using RRWP as the graph PE. The resulting Powerful Plain Graph Transformers (PPGT) achieve strong empirical expressivity on the GD-WL framework, outperforming more complex GTs on synthetic isomorphism benchmarks and delivering top performance on multiple real-world graph datasets, including large-scale PCQM4Mv2, while preserving the simplicity of vanilla Transformers. Thework also provides ablations that highlight the value of preserving token magnitude information, the benefits of SPE, and the viability of plain Transformer architectures for graph tasks. Overall, PPGT offers a practical, scalable path toward unifying graph learning with other modalities under plain Transformer paradigms, with potential implications for multi-modal foundation models. The main limitations are quadratic time/space complexity and scalability concerns for very large graphs, discussed further in the appendix.

Abstract

Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated competitors such as subgraph GNNs and higher-order GNNs. Its outstanding empirical performance across various graph datasets also justifies the practical effectiveness of PPGT.

Plain Transformers Can be Powerful Graph Learners

TL;DR

This paper demonstrates that plain Transformers can be powerful graph learners by introducing three minimal modifications: AdaRMSN normalization, simplified attention (s) that blends angle and magnitude information, and an MLP-based stem for graph positional encoding, using RRWP as the graph PE. The resulting Powerful Plain Graph Transformers (PPGT) achieve strong empirical expressivity on the GD-WL framework, outperforming more complex GTs on synthetic isomorphism benchmarks and delivering top performance on multiple real-world graph datasets, including large-scale PCQM4Mv2, while preserving the simplicity of vanilla Transformers. Thework also provides ablations that highlight the value of preserving token magnitude information, the benefits of SPE, and the viability of plain Transformer architectures for graph tasks. Overall, PPGT offers a practical, scalable path toward unifying graph learning with other modalities under plain Transformer paradigms, with potential implications for multi-modal foundation models. The main limitations are quadratic time/space complexity and scalability concerns for very large graphs, discussed further in the appendix.

Abstract

Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated competitors such as subgraph GNNs and higher-order GNNs. Its outstanding empirical performance across various graph datasets also justifies the practical effectiveness of PPGT.

Paper Structure

This paper contains 67 sections, 3 theorems, 18 equations, 6 figures, 10 tables.

Key Result

Proposition E.1

Powerful Plain Graph Transformers (PPGT) with generalized distance (GD) as graph PE are as powerful as GD-WL, when choosing proper functions $\phi$ and $\theta$ and using a sufficiently large number of heads and layers.

Figures (6)

  • Figure 1: (a) Graph Transformers usually consist of preprocessing blocks (i.e., stems), backbone blocks (i.e., Transformer layers), and task-specific output heads. (b) GraphGPS introduces a complicated hybrid architecture integrating MPNN layers with sparse edge updates. (c) GRIT is equipped with a complex attention mechanism (conditional MLPs) with PE update and degree-scaler. On the other hand, (d) the proposed PPGT blocks simply follow the plain Transformer architecture, where s$L_2$ attention is implemented as SDP attention via float attention mask, and AdaRMSN is a direct substitute of RMSN.
  • Figure 2: Illustration for comparing different attention mechanisms: (b) visualization of attention scores. SDP attention is biased towards larger-magnitude $k_2$. Cos attention disregards the magnitude information. $L_2$ attention strikes a balance among SDP and Cos attention to attend to $k_1$, which has the lowest $L_2$ distance to the query $q$.
  • Figure 3: Illustration of two node-pairs (a) $(i,j)$ and $(i,k)$ of a graph, and (b) absolute difference of RRWPs and sinusoidally-encoded RRWPs for those two node-pairs.
  • Figure 4: Ablation Study on ZINC. MLPA: Conditional MLP Attention; DegS: degree scaler; ARMSN: AdaRMSNorm; URP: Universal RPE; SPE: Sinusoidal PE enhancement.
  • Figure 5: (Case Study of AdaRMSN) Visualization of Input and Pred data points [(1). Input; (2) Predictions w/ BN; (3) Predictions w/ RMSN; (4) Predictions w/ AdaRMSN]. RMSN is ineffective in preserving magnitude information, whereas both BN and AdaRMSN successfully maintain the crucial magnitude information of the data point
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition E.1
  • Lemma E.2
  • proof : Proof of Proposition \ref{['prop:ppgt_gdwl']}
  • Proposition E.3
  • proof : Proof of Proposition \ref{['prop:ln_rmsn_magnitude']}