Table of Contents
Fetching ...

On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers

Cai Zhou, Rose Yu, Yusu Wang

TL;DR

The paper addresses the expressiveness and design of higher-order graph transformers, showing that plain order-$k$ transformers fall short of the $k$-WL hierarchy unless explicit tuple indices are provided, which then enables $k$-WL-level expressiveness at higher computational cost. It develops a practical design space of sparse attention mechanisms—neighbor, local neighbor, and virtual-tuple attention—along with kernelized attention and simplicial variants to balance efficiency and expressive power. Theoretical results characterize the trade-offs between expressiveness and complexity, while experiments on synthetic and real-world datasets demonstrate the effectiveness of sparsification and structure-aware encodings, including successful scaling to large graphs. The findings offer a unified framework for building scalable, expressive higher-order graph transformers with real-world applicability in molecular graphs, long-range interactions, and graph-level tasks.

Abstract

Graph transformers have recently received significant attention in graph learning, partly due to their ability to capture more global interaction via self-attention. Nevertheless, while higher-order graph neural networks have been reasonably well studied, the exploration of extending graph transformers to higher-order variants is just starting. Both theoretical understanding and empirical results are limited. In this paper, we provide a systematic study of the theoretical expressive power of order-$k$ graph transformers and sparse variants. We first show that, an order-$k$ graph transformer without additional structural information is less expressive than the $k$-Weisfeiler Lehman ($k$-WL) test despite its high computational cost. We then explore strategies to both sparsify and enhance the higher-order graph transformers, aiming to improve both their efficiency and expressiveness. Indeed, sparsification based on neighborhood information can enhance the expressive power, as it provides additional information about input graph structures. In particular, we show that a natural neighborhood-based sparse order-$k$ transformer model is not only computationally efficient, but also expressive -- as expressive as $k$-WL test. We further study several other sparse graph attention models that are computationally efficient and provide their expressiveness analysis. Finally, we provide experimental results to show the effectiveness of the different sparsification strategies.

On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers

TL;DR

The paper addresses the expressiveness and design of higher-order graph transformers, showing that plain order- transformers fall short of the -WL hierarchy unless explicit tuple indices are provided, which then enables -WL-level expressiveness at higher computational cost. It develops a practical design space of sparse attention mechanisms—neighbor, local neighbor, and virtual-tuple attention—along with kernelized attention and simplicial variants to balance efficiency and expressive power. Theoretical results characterize the trade-offs between expressiveness and complexity, while experiments on synthetic and real-world datasets demonstrate the effectiveness of sparsification and structure-aware encodings, including successful scaling to large graphs. The findings offer a unified framework for building scalable, expressive higher-order graph transformers with real-world applicability in molecular graphs, long-range interactions, and graph-level tasks.

Abstract

Graph transformers have recently received significant attention in graph learning, partly due to their ability to capture more global interaction via self-attention. Nevertheless, while higher-order graph neural networks have been reasonably well studied, the exploration of extending graph transformers to higher-order variants is just starting. Both theoretical understanding and empirical results are limited. In this paper, we provide a systematic study of the theoretical expressive power of order- graph transformers and sparse variants. We first show that, an order- graph transformer without additional structural information is less expressive than the -Weisfeiler Lehman (-WL) test despite its high computational cost. We then explore strategies to both sparsify and enhance the higher-order graph transformers, aiming to improve both their efficiency and expressiveness. Indeed, sparsification based on neighborhood information can enhance the expressive power, as it provides additional information about input graph structures. In particular, we show that a natural neighborhood-based sparse order- transformer model is not only computationally efficient, but also expressive -- as expressive as -WL test. We further study several other sparse graph attention models that are computationally efficient and provide their expressiveness analysis. Finally, we provide experimental results to show the effectiveness of the different sparsification strategies.
Paper Structure (71 sections, 18 theorems, 73 equations, 2 figures, 13 tables)

This paper contains 71 sections, 18 theorems, 73 equations, 2 figures, 13 tables.

Key Result

Theorem 3.2

Without taking tuple indices as inputs, $\mathcal{A}_k$ is strictly less expressive than $k$-WL.

Figures (2)

  • Figure 1: Variants of $k$-th order self-attention, i.e. $\mathcal{A}_k$ and its sparse forms. In the figure order $k=2$, number of nodes $n=3$. For simplicity, we only show the attention of query token ${\bm{i}}=(1,2)$, and all $n^k$ (real) tuples are calculated with the same rule. The dashed lines are only for aesthetic illustration. (a) Global attention (plain $\mathcal{A}_k$), the query token computes attention with all $n^k$ tuples. (b) Neighbor attention, the query token computes attention with its $k$-neighbors; $k$-neighbor is of the same definition as in $k$-WL. (c) Local neighbor attention, where the query token computes attention with only its local neighbors; local neighbor is of the same definition as in WLgoSparse. (d) Virtual tuple attention, the query token only computes attention with the virtual tuples (we only display one for simplicity), while each virtual tuple computes attention with all other real tuples.
  • Figure 2: A pair of non-isomorphic graphs that can be distinguished by $1$-WL and $2$-WL, but cannot be distinguished by $\mathcal{A}_1$ and $\mathcal{A}_2$. For $k\geq 3$, both $k$-WL and $\mathcal{A}_k$ can distinguish them.

Theorems & Definitions (31)

  • Definition 3.1: Order $k_1,k_2$-Transformer Layer
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Proposition 4.4
  • Theorem B.1
  • proof
  • Corollary B.2
  • ...and 21 more