Table of Contents
Fetching ...

Towards Principled Graph Transformers

Luis Müller, Daniel Kusuma, Blai Bonet, Christopher Morris

TL;DR

This work shows that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power, and demonstrates that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings.

Abstract

Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. However, such architectures often fail to deliver solid predictive performance on real-world tasks, limiting their practical impact. In contrast, global attention-based models such as graph transformers demonstrate strong performance in practice, but comparing their expressive power with the k-WL hierarchy remains challenging, particularly since these architectures rely on positional or structural encodings for their expressivity and predictive performance. To address this, we show that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power. Empirically, we demonstrate that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings. Our code is available at https://github.com/luis-mueller/towards-principled-gts

Towards Principled Graph Transformers

TL;DR

This work shows that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power, and demonstrates that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings.

Abstract

Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. However, such architectures often fail to deliver solid predictive performance on real-world tasks, limiting their practical impact. In contrast, global attention-based models such as graph transformers demonstrate strong performance in practice, but comparing their expressive power with the k-WL hierarchy remains challenging, particularly since these architectures rely on positional or structural encodings for their expressivity and predictive performance. To address this, we show that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power. Empirically, we demonstrate that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings. Our code is available at https://github.com/luis-mueller/towards-principled-gts
Paper Structure (39 sections, 8 theorems, 82 equations, 4 figures, 20 tables, 1 algorithm)

This paper contains 39 sections, 8 theorems, 82 equations, 4 figures, 20 tables, 1 algorithm.

Key Result

Theorem 1

The ET has exactly $3$-WL expressive power.

Figures (4)

  • Figure 1: Tokenization of the Edge Transformer. Given a graph $G$, we construct a 3D tensor where we embed information from each node pair into a $d$ dimensional vector.
  • Figure 2: Tensor operations in a single triangular attention head; see \ref{['algo:comparison']} for a comparison to standard attention in pseudo-code.
  • Figure 3: Difference in micro F1 with and without the OOD validation technique in JungAhn+2023+TEAM, for Triplet-GMPNN ibarz+2022+generalist and ET, respectively.
  • Figure 4: Runtime of the forward pass of a single ET layer in PyTorch in seconds for graphs with up to 700 nodes. We compare the runtime with and without torch.compile (automatic compilation into Triton Tillet+2019+Triton) enabled. Without compilation, the ET goes out of memory after 600 nodes.

Theorems & Definitions (13)

  • Theorem 1: Informal
  • Theorem 2: Theorem 5.2 Cai+1992, informally
  • Corollary 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Proposition 6
  • proof
  • Proposition 7
  • ...and 3 more