Aligning Transformers with Weisfeiler-Leman
Luis Müller, Christopher Morris
TL;DR
This work tackles the expressivity gap in graph learning by aligning pure transformers with the Weisfeiler--Leman ($k$-WL) hierarchy, aiming to achieve higher-order discrimination without prohibitive computational costs.It introduces a theory-driven transformer framework, including the $1$-GT and a scalable $k$-GT, alongside the $(k,s)$-GT, and demonstrates that with adjacency-identifying node encodings such as Laplacian PEs (LPE) and Spectral PE (SPE), pure transformers can emulate $k$-WL dynamics.Practically, the authors validate their approach with large-scale pre-training on PCQM4Mv2 and fine-tuning on molecular datasets, showing competitive predictive performance and strong transfer to small downstream tasks, and they demonstrate expressivity gains on targeted benchmarks like BREC.Order transfer further enables leveraging higher-order expressivity for downstream tasks while reusing lower-order pre-trained weights, making higher-order transformers feasible in practice.Overall, the work provides a principled path to more expressive, scalable pure-transformer graph models with tangible gains in real-world datasets.
Abstract
Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.
