Table of Contents
Fetching ...

SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations

Qitian Wu, Wentao Zhao, Chenxiao Yang, Hengrui Zhang, Fan Nie, Haitian Jiang, Yatao Bian, Junchi Yan

TL;DR

SGFormer shows that learning on very large graphs can be effectively achieved with a single-layer global attention, providing competitive node representations while scaling linearly with the number of nodes. By avoiding positional encodings, pre-processing, and augmented losses, it delivers substantial efficiency gains and enables web-scale training (e.g., ogbn-papers100M) with modest hardware. The work offers theoretical insight by connecting one-layer attention to a denoising objective and proving equivalence to multi-layer attention under certain constructions, explaining why shallow attention can suffice. Overall, SGFormer presents a practical, scalable pathway for Transformer-based graph representations with broad implications for large-graph learning tasks.

Abstract

Learning representations on large-sized graphs is a long-standing challenge due to the inter-dependence nature involved in massive data points. Transformers, as an emerging class of foundation encoders for graph-structured data, have shown promising performance on small graphs due to its global attention capable of capturing all-pair influence beyond neighboring nodes. Even so, existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated models by stacking deep multi-head attentions. In this paper, we critically demonstrate that even using a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks where node numbers range from thousand-level to billion-level. This encourages us to rethink the design philosophy for Transformers on large graphs, where the global attention is a computation overhead hindering the scalability. We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model that can efficiently propagate information among arbitrary nodes in one layer. SGFormer requires none of positional encodings, feature/graph pre-processing or augmented loss. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M and yields up to 141x inference acceleration over SOTA Transformers on medium-sized graphs. Beyond current results, we believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.

SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations

TL;DR

SGFormer shows that learning on very large graphs can be effectively achieved with a single-layer global attention, providing competitive node representations while scaling linearly with the number of nodes. By avoiding positional encodings, pre-processing, and augmented losses, it delivers substantial efficiency gains and enables web-scale training (e.g., ogbn-papers100M) with modest hardware. The work offers theoretical insight by connecting one-layer attention to a denoising objective and proving equivalence to multi-layer attention under certain constructions, explaining why shallow attention can suffice. Overall, SGFormer presents a practical, scalable pathway for Transformer-based graph representations with broad implications for large-graph learning tasks.

Abstract

Learning representations on large-sized graphs is a long-standing challenge due to the inter-dependence nature involved in massive data points. Transformers, as an emerging class of foundation encoders for graph-structured data, have shown promising performance on small graphs due to its global attention capable of capturing all-pair influence beyond neighboring nodes. Even so, existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated models by stacking deep multi-head attentions. In this paper, we critically demonstrate that even using a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks where node numbers range from thousand-level to billion-level. This encourages us to rethink the design philosophy for Transformers on large graphs, where the global attention is a computation overhead hindering the scalability. We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model that can efficiently propagate information among arbitrary nodes in one layer. SGFormer requires none of positional encodings, feature/graph pre-processing or augmented loss. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M and yields up to 141x inference acceleration over SOTA Transformers on medium-sized graphs. Beyond current results, we believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
Paper Structure (20 sections, 2 theorems, 18 equations, 6 figures, 5 tables)

This paper contains 20 sections, 2 theorems, 18 equations, 6 figures, 5 tables.

Key Result

Theorem 1

For any given attention matrix $\mathbf C^{(k)} = [c_{uv}^{(k)}]_{N\times N}$, Eqn. eqn-attn-update is equivalent to a gradient descent operation with step size $\frac{\tau}{2\lambda}$ for an optimization problem with the cost function: where $\lambda$ is a trading weight parameter for the local smoothness and global smoothness criteria globallocal-2003.

Figures (6)

  • Figure 1: Illustration of the proposed model SGFormer and its data flow. The input graph data entails node features $\mathbf X$ and graph adjacency $\mathbf A$. For large graphs, we need to use mini-batch sampling that randomly partitions the input graph into mini-batches with smaller sizes. Each mini-batch is composed of the features of the nodes within this mini-batch $\mathbf X_m$ and the local graph adjacency $\mathbf A_m$ (one can also use neighbor sampling as an alternative). The mini-batch data $(\mathbf X_m, \mathbf A_m)$ (for large graphs) or the whole graph data $(\mathbf X, \mathbf A)$ (for small graphs) will be fed into the SGFormer model that is implemented with a one-layer global attention and a GNN network. The model outputs the node representations for final prediction.
  • Figure 2: Scalability test of training time per epoch and GPU memory usage w.r.t. graph sizes (a.k.a. node numbers). NodeFormer suffers out-of-memory when # nodes reaches more than 30K.
  • Figure 3: Testing scores and training time per epoch of SGFormer w.r.t. # attention layers. More results on more datasets are deferred to Appendix \ref{['appx-result']}.
  • Figure 4: Testing performance of NodeFormer, SGFormer w/o self-loop and SGFormer w/ Softmax w.r.t. the number of attention layers. The missing results are caused by out-of-memory.
  • Figure 5: Illustration of the theoretical analysis showing the equivalence between the multi-layer and one-layer attention models. The one-layer attention model can produce the same effect on the smoothness criteria and help to save potential redundancy.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof