Aligning Transformers with Weisfeiler-Leman

Luis Müller; Christopher Morris

Aligning Transformers with Weisfeiler-Leman

Luis Müller, Christopher Morris

TL;DR

This work tackles the expressivity gap in graph learning by aligning pure transformers with the Weisfeiler--Leman ($k$-WL) hierarchy, aiming to achieve higher-order discrimination without prohibitive computational costs.It introduces a theory-driven transformer framework, including the $1$-GT and a scalable $k$-GT, alongside the $(k,s)$-GT, and demonstrates that with adjacency-identifying node encodings such as Laplacian PEs (LPE) and Spectral PE (SPE), pure transformers can emulate $k$-WL dynamics.Practically, the authors validate their approach with large-scale pre-training on PCQM4Mv2 and fine-tuning on molecular datasets, showing competitive predictive performance and strong transfer to small downstream tasks, and they demonstrate expressivity gains on targeted benchmarks like BREC.Order transfer further enables leveraging higher-order expressivity for downstream tasks while reusing lower-order pre-trained weights, making higher-order transformers feasible in practice.Overall, the work provides a principled path to more expressive, scalable pure-transformer graph models with tangible gains in real-world datasets.

Abstract

Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.

Aligning Transformers with Weisfeiler-Leman

TL;DR

Abstract

Graph neural network architectures aligned with the

-dimensional Weisfeiler--Leman (

-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the

-WL hierarchy have shown promising empirical results, employing transformers for higher orders of

remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the

-WL hierarchy, showing stronger expressivity results for each

, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.

Paper Structure (44 sections, 30 theorems, 324 equations, 2 figures, 6 tables)

This paper contains 44 sections, 30 theorems, 324 equations, 2 figures, 6 tables.

Introduction
Present work
Related work
Background
Expressive power of transformers on graphs
Transformers with $1$-WL expressive power
Transformers with $k$-WL expressive power
Implementation details
Node and adjacency-identifying PEs
LPE
SPE
Order transfer
Experimental evaluation
Pre-training
Fine-tuning
...and 29 more sections

Key Result

Theorem 2

Let $G = (V(G), E(G), \ell)$ be a labeled graph with $n$ nodes and $\mathbf{F} \in \mathbb{R}^{n \times d}$ be a node feature matrix consistent with $\ell$. Further, let $C^1_t \colon V(G) \rightarrow \mathbb{N}$ denote the coloring function of the $1$-WL at iteration $t$. Then, for all iterations $ for all nodes $v, w \in V(G)$.

Figures (2)

Figure 1: Overview of our theoretical results, aligning transformers with the established $k$-WL hierarchy. Forward arrows point to more powerful algorithms or neural architectures. $A \sqsubset B$ ($A \sqsubseteq B$, $A \equiv B$)---algorithm $A$ is strictly more powerful than (as least as powerful as, equally powerful as) $B$. The relations between the boxes in the lower row stem from Cai+1992 and Mor+2022b.
Figure 2: Visual explanation of the "opposing forces" in \ref{['lemma:sufficiently_indicator']}. In (a) before softmax: We consider three numbers $x_1$, $x_2$ and $x_3$, where $x_2$ and $x_3$ are less than $\delta$ apart. In (b) after softmax: An increase in $b$ (blue) pushes the maximum value $x_3$ away from $x_1$ and $x_2$. However, the approximation with $\delta$ acts stronger (red). As a result, $x_1$ gets pushed closer to $0$, but $x_2$ and $x_3$ get pushed closer. In (c) after softmax: Further increasing $b$ makes $x_1$ converge to $0$, but the approximation with $\delta$ pushes $x_2$ and $x_3$ closer together, and the softmax maps both values approximately to $\frac{1}{2}$ (here depicted with the same dot). Hence, with a sufficiently close approximation, we can approximate the weighted indicator matrix $\Tilde{\mathbf{X}}$.

Theorems & Definitions (54)

Definition 1: Adjacency-identifying
Theorem 2
Definition 3: Node-identifying
Theorem 4
Theorem 5
Theorem 6
Theorem 7
Theorem 8
Lemma 9
proof
...and 44 more

Aligning Transformers with Weisfeiler-Leman

TL;DR

Abstract

Aligning Transformers with Weisfeiler-Leman

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (54)