A Scalable and Effective Alternative to Graph Transformers

Kaan Sancak; Zhigang Hua; Jin Fang; Yan Xie; Andrey Malevich; Bo Long; Muhammed Fatih Balin; Ümit V. Çatalyürek

A Scalable and Effective Alternative to Graph Transformers

Kaan Sancak, Zhigang Hua, Jin Fang, Yan Xie, Andrey Malevich, Bo Long, Muhammed Fatih Balin, Ümit V. Çatalyürek

TL;DR

This work addresses the inefficiency of dense attention in Graph Transformers when scaling to large graphs. It introduces GECO, a compact layer that combines Local Propagation Block (LCB) and Global Context Block (GCB) to capture local and global dependencies in quasilinear time, achieving $O(KN\log N + M)$ training per layer. GECO demonstrates strong predictive quality on small graphs and exceptional scalability on large graphs, delivering up to $169\times$ speedups over optimized attention and up to $4.5\%$ accuracy gains relative to state-of-the-art baselines. The approach enables practical large-scale graph learning without partitioning or excessive memory, signaling a robust path forward for scalable graph representation learning.

Abstract

Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.

A Scalable and Effective Alternative to Graph Transformers

TL;DR

training per layer. GECO demonstrates strong predictive quality on small graphs and exceptional scalability on large graphs, delivering up to

speedups over optimized attention and up to

accuracy gains relative to state-of-the-art baselines. The approach enables practical large-scale graph learning without partitioning or excessive memory, signaling a robust path forward for scalable graph representation learning.

Abstract

Paper Structure (44 sections, 4 theorems, 16 equations, 6 figures, 15 tables, 6 algorithms)

This paper contains 44 sections, 4 theorems, 16 equations, 6 figures, 15 tables, 6 algorithms.

Introduction
Background and Related Work
Graph Neural Networks (GNNs)
Graph Transformers (GTs)
Attention Alternatives
Proposed Architecture: GECO
Graph Structural/Positional Encodings
Local Propagation Block (LCB)
Global Context Block (GCB)
Surrogate Attention Analysis
Pitfalls of Permutation Sensitivity and Mitigation Strategies
End-to-End Training
Comparison with Prior Work
Experiments
Objective 1: Prediction Quality
...and 29 more sections

Key Result

Proposition 3.1

LCB can be computed in $\mathcal{O}(N + M)$ using Sparse Matrix Matrix (SpMM) multiplication between $X^{(l)}$ and $A$ in linear time complexity, where $M = |E|$.

Figures (6)

Figure 1: Our architecture comprises Positional Encoding (PE) block and Graph-Enhanced Contextual Operators (GECOs) layers. PE adds positional encodings as a preprocessing step and each GECO is followed by an FFN. A GECO layer contains a Local Propagation Block (LCB) aggregating neighborhood embeddings and concatenating with originals to capture local dependencies, and a Global Context Block (GCB) efficiently capturing global dependencies via global convolutions.
Figure 2: Relative speedup of GECO w.r.t. FlashAttention dao2022flashattention characterized by $\mathcal{O}(N/\log N)$
Figure 3: A comparison between GraphGPS and GECO, where the layers with learnable weights are highlighted in color.
Figure 4: Illustration of JK-Nets with 3 layers. It is important to note that the Final Layer can be implemented using different layers, and it does not necessarily have to be the same as the intermediate layers. Although the original work While the original work xu2018representation did not introduce a dense skip connection from the original inputs to the Final Layer, we have included it here for the sake of consistency in notation.
Figure : Forward pass of GCB Operator
...and 1 more figures

Theorems & Definitions (5)

Proposition 3.1
Proposition 3.2
Proposition 3.3
Definition D.1: Janossy Pooling murphy2018janossy
Proposition D.2: $\pi$-SGD Convergence

A Scalable and Effective Alternative to Graph Transformers

TL;DR

Abstract

A Scalable and Effective Alternative to Graph Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)