Towards Principled Graph Transformers

Luis Müller; Daniel Kusuma; Blai Bonet; Christopher Morris

Towards Principled Graph Transformers

Luis Müller, Daniel Kusuma, Blai Bonet, Christopher Morris

TL;DR

This work shows that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power, and demonstrates that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings.

Abstract

Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. However, such architectures often fail to deliver solid predictive performance on real-world tasks, limiting their practical impact. In contrast, global attention-based models such as graph transformers demonstrate strong performance in practice, but comparing their expressive power with the k-WL hierarchy remains challenging, particularly since these architectures rely on positional or structural encodings for their expressivity and predictive performance. To address this, we show that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power. Empirically, we demonstrate that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings. Our code is available at https://github.com/luis-mueller/towards-principled-gts

Towards Principled Graph Transformers

TL;DR

Abstract

Paper Structure (39 sections, 8 theorems, 82 equations, 4 figures, 20 tables, 1 algorithm)

This paper contains 39 sections, 8 theorems, 82 equations, 4 figures, 20 tables, 1 algorithm.

Introduction
Related work
Edge Transformers
Tokenization
Efficiency
Positional/structural encodings
Readout
The expressivity of Edge Transformers
Folklore Weisfeiler--Leman
The logic of Edge Transformers
Language and configurations
Systematic generalization
Experimental evaluation
Datasets
Baselines
...and 24 more sections

Key Result

Theorem 1

The ET has exactly $3$-WL expressive power.

Figures (4)

Figure 1: Tokenization of the Edge Transformer. Given a graph $G$, we construct a 3D tensor where we embed information from each node pair into a $d$ dimensional vector.
Figure 2: Tensor operations in a single triangular attention head; see \ref{['algo:comparison']} for a comparison to standard attention in pseudo-code.
Figure 3: Difference in micro F1 with and without the OOD validation technique in JungAhn+2023+TEAM, for Triplet-GMPNN ibarz+2022+generalist and ET, respectively.
Figure 4: Runtime of the forward pass of a single ET layer in PyTorch in seconds for graphs with up to 700 nodes. We compare the runtime with and without torch.compile (automatic compilation into Triton Tillet+2019+Triton) enabled. Without compilation, the ET goes out of memory after 600 nodes.

Theorems & Definitions (13)

Theorem 1: Informal
Theorem 2: Theorem 5.2 Cai+1992, informally
Corollary 3
Lemma 4
proof
Lemma 5
proof
Proposition 6
proof
Proposition 7
...and 3 more

Towards Principled Graph Transformers

TL;DR

Abstract

Towards Principled Graph Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (13)