Table of Contents
Fetching ...

A Scalable and Effective Alternative to Graph Transformers

Kaan Sancak, Zhigang Hua, Jin Fang, Yan Xie, Andrey Malevich, Bo Long, Muhammed Fatih Balin, Ümit V. Çatalyürek

TL;DR

This work addresses the inefficiency of dense attention in Graph Transformers when scaling to large graphs. It introduces GECO, a compact layer that combines Local Propagation Block (LCB) and Global Context Block (GCB) to capture local and global dependencies in quasilinear time, achieving $O(KN\log N + M)$ training per layer. GECO demonstrates strong predictive quality on small graphs and exceptional scalability on large graphs, delivering up to $169\times$ speedups over optimized attention and up to $4.5\%$ accuracy gains relative to state-of-the-art baselines. The approach enables practical large-scale graph learning without partitioning or excessive memory, signaling a robust path forward for scalable graph representation learning.

Abstract

Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.

A Scalable and Effective Alternative to Graph Transformers

TL;DR

This work addresses the inefficiency of dense attention in Graph Transformers when scaling to large graphs. It introduces GECO, a compact layer that combines Local Propagation Block (LCB) and Global Context Block (GCB) to capture local and global dependencies in quasilinear time, achieving training per layer. GECO demonstrates strong predictive quality on small graphs and exceptional scalability on large graphs, delivering up to speedups over optimized attention and up to accuracy gains relative to state-of-the-art baselines. The approach enables practical large-scale graph learning without partitioning or excessive memory, signaling a robust path forward for scalable graph representation learning.

Abstract

Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.
Paper Structure (44 sections, 4 theorems, 16 equations, 6 figures, 15 tables, 6 algorithms)

This paper contains 44 sections, 4 theorems, 16 equations, 6 figures, 15 tables, 6 algorithms.

Key Result

Proposition 3.1

LCB can be computed in $\mathcal{O}(N + M)$ using Sparse Matrix Matrix (SpMM) multiplication between $X^{(l)}$ and $A$ in linear time complexity, where $M = |E|$.

Figures (6)

  • Figure 1: Our architecture comprises Positional Encoding (PE) block and Graph-Enhanced Contextual Operators (GECOs) layers. PE adds positional encodings as a preprocessing step and each GECO is followed by an FFN. A GECO layer contains a Local Propagation Block (LCB) aggregating neighborhood embeddings and concatenating with originals to capture local dependencies, and a Global Context Block (GCB) efficiently capturing global dependencies via global convolutions.
  • Figure 2: Relative speedup of GECO w.r.t. FlashAttention dao2022flashattention characterized by $\mathcal{O}(N/\log N)$
  • Figure 3: A comparison between GraphGPS and GECO, where the layers with learnable weights are highlighted in color.
  • Figure 4: Illustration of JK-Nets with 3 layers. It is important to note that the Final Layer can be implemented using different layers, and it does not necessarily have to be the same as the intermediate layers. Although the original work While the original work xu2018representation did not introduce a dense skip connection from the original inputs to the Final Layer, we have included it here for the sake of consistency in notation.
  • Figure : Forward pass of GCB Operator
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Definition D.1: Janossy Pooling murphy2018janossy
  • Proposition D.2: $\pi$-SGD Convergence