Table of Contents
Fetching ...

k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS

Jonas De Schouwer, Haitz Sáez de Ocáriz Borde, Xiaowen Dong

Abstract

Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modelling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.

k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS

Abstract

Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modelling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.

Paper Structure

This paper contains 83 sections, 6 theorems, 25 equations, 9 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{A}$ be an instance of the GraphGPS framework that enhances its node features with $\nu(\bm{A})$ and its edge features with $\mu(\bm{A})$. Then $\mathcal{A}$ can only distinguish graphs that are also distinguishable by the $S$-SEG-WL test, where $S=(f_A,f_R)$ and $f_A,f_R$ are defined a where the set of possible colors is $\mathcal{C} = \mathbb{R}^{d_{PE}} \, \cup \, \{0,1\}^2\times\m

Figures (9)

  • Figure 1: One layer in the GraphGPS framework. The dashed lines indicate residual connections. Based on rampavsek2022recipe.
  • Figure 2: Comparison of full attention and k-MIP attention (with/without symbolic matrices). Shown is the mean $\pm$ std over 5 runs, measured on a single 40GB A100 GPU. Full attention gave OOM errors in the training setting for $N\geq10^5$. The data behind this figure can be found in \ref{['app:detailed_performance_measurements']}.
  • Figure 3: Runtime breakdown of full attention and k-MIP attention in the training setting at $N=10^{4.5}$, measured on a single 40GB A100 GPU. The left and middle bars compare full and k-MIP attention at the same scale; the right panel zooms into k-MIP attention.
  • Figure 4: Tradeoffs between training time and accuracy across different datasets in City-Networks liang2025towards. Shown is the mean $\pm$ std over four runs, except for the London dataset where we only ran one run for the graph transformer models. GPS+BigBird was not evaluated on LA due to long training times. GPS+Transformer, Exphormer, and GPS+BigBird returned OOM on London.
  • Figure 5: Illustration of a single iteration of the Weisfeiler-Lehman test. The node labels (in $\mathcal{C} = \mathbb{N}$) are written inside the nodes. The mapping $\tau$ is written next to the nodes.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Theorem 1
  • Definition 1: General transformer block
  • Definition 2: Class $\mathcal{T}_{\mathcal{A}}^{h,m,r}$
  • Theorem 2: k-MIP Approximation Theorem
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6: zhu2023structural
  • Definition 7: zhu2023structural
  • Definition 8
  • ...and 8 more