Table of Contents
Fetching ...

SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity

Qitian Wu, Kai Yang, Hengrui Zhang, David Wipf, Junchi Yan

TL;DR

This work proposes a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions.

Abstract

Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.

SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity

TL;DR

This work proposes a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions.

Abstract

Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.
Paper Structure (22 sections, 5 theorems, 26 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 5 theorems, 26 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

For any propagation matrix $\mathbf P^{(k)} = [p_{uv}^{(k)}]_{N\times N}$ and symmetric weight matrix $\mathbf W^{(k)}$, Eqn. eqn-update is a gradient descent step with step size $\frac{1}{2}$ for the optimization problem w.r.t. the quadratic energy: $E(\mathbf Z; \mathbf Z^{(k)}, \mathbf P^{(k)}, \ where $d_{u}^{(k)} = \sum_{v=1}^N p_{uv}^{(k)}$ and the weighted vector norm is defined by $\|\math

Figures (5)

  • Figure 1: Illustration of the main theoretical results in Sec. \ref{['sec-theory']}. (a) The layer-wise updating rule of message passing models (e.g., GNNs and Transformers) is equivalent to a gradient descent step minimizing a regularized energy in graph signal denoising. The energy has two-fold regularization effects, which enforce local and global smoothness, respectively. (b) Common Transformers stacking multiple propagation layers can be seen as a cascade of descent steps on layer-dependent energy (since the attention scores and feature transformations are specific to each layer). (c) The multi-layer model can be reduced to a one-layer model where the latter contributes to the same denoising effect, i.e., yielding the equivalent output embeddings.
  • Figure 2: (a) Data flow of SGFormer. The input data entails node features $\mathbf X$ and graph adjacency $\mathbf A$. SGFormer is comprised of a single-layer global attention and a GNN network. The model outputs node representations for final prediction. (b) Computation flow of the simple attention function utilized by SGFormer which accommondates all-pair influence among $N$ nodes for computing the updated embeddings within $O(N)$ complexity.
  • Figure 3: Scalability test of training time per epoch and GPU memory cost w.r.t. graph sizes (a.k.a. node numbers). NodeFormer reports OOM when # nodes reaches more than 30K.
  • Figure 4: Comparison of single-layer v.s. multi-layer models on 12 experimental datasets. In each dataset, we plot the training time cost per epoch and testing scores (Accuracy/ROC-AUC) of SGFormer w.r.t. the number of attention layers.
  • Figure 5: Performance comparison of single-layer v.s. multi-layer models on 12 experimental datasets. In each dataset, we plot the testing scores (Accuracy/ROC-AUC) of SGFormer, SGFormer w/o self-loop, SGFormer w/ Softmax and NodeFormer w.r.t. the number of attention layers.

Theorems & Definitions (10)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Proposition 1
  • proof
  • Theorem 2
  • proof
  • Corollary 2
  • proof