Table of Contents
Fetching ...

Exploring the Global-to-Local Attention Scheme in Graph Transformers: An Empirical Study

Gang Wu, Zhengwei Wang

TL;DR

Graph Transformers often suffer from over-globalizing tendencies that neglect local neighborhood information. This work proposes G2LFormer, a global-to-local attention model where shallow global attention captures long-range dependencies and deeper local GNNs learn local structure, connected by a NOSAF-based cross-layer fusion to prevent information loss. Using SGFormer as the global backbone and Cluster-GCN or GatedGCN as local backbones, G2LFormer achieves state-of-the-art results on node- and graph-level tasks while maintaining linear complexity $O(N+|\mathcal{E}|)$ and a fusion cost of $O(N d' d'')$, which reduces to $O(N)$ for small $d',d''$. Experiments across diverse datasets demonstrate strong performance and scalability, highlighting the viability of global-to-local architectures for efficient, expressive graph representation learning.

Abstract

Graph Transformers (GTs) show considerable potential in graph representation learning. The architecture of GTs typically integrates Graph Neural Networks (GNNs) with global attention mechanisms either in parallel or as a precursor to attention mechanisms, yielding a local-and-global or local-to-global attention scheme. However, as the global attention mechanism primarily captures long-range dependencies between nodes, these integration schemes may suffer from information loss, where the local neighborhood information learned by GNN could be diluted by the attention mechanism. Therefore, we propose G2LFormer, featuring a novel global-to-local attention scheme where the shallow network layers use attention mechanisms to capture global information, while the deeper layers employ GNN modules to learn local structural information, thereby preventing nodes from ignoring their immediate neighbors. An effective cross-layer information fusion strategy is introduced to allow local layers to retain beneficial information from global layers and alleviate information loss, with acceptable trade-offs in scalability. To validate the feasibility of the global-to-local attention scheme, we compare G2LFormer with state-of-the-art linear GTs and GNNs on node-level and graph-level tasks. The results indicate that G2LFormer exhibits excellent performance while keeping linear complexity.

Exploring the Global-to-Local Attention Scheme in Graph Transformers: An Empirical Study

TL;DR

Graph Transformers often suffer from over-globalizing tendencies that neglect local neighborhood information. This work proposes G2LFormer, a global-to-local attention model where shallow global attention captures long-range dependencies and deeper local GNNs learn local structure, connected by a NOSAF-based cross-layer fusion to prevent information loss. Using SGFormer as the global backbone and Cluster-GCN or GatedGCN as local backbones, G2LFormer achieves state-of-the-art results on node- and graph-level tasks while maintaining linear complexity and a fusion cost of , which reduces to for small . Experiments across diverse datasets demonstrate strong performance and scalability, highlighting the viability of global-to-local architectures for efficient, expressive graph representation learning.

Abstract

Graph Transformers (GTs) show considerable potential in graph representation learning. The architecture of GTs typically integrates Graph Neural Networks (GNNs) with global attention mechanisms either in parallel or as a precursor to attention mechanisms, yielding a local-and-global or local-to-global attention scheme. However, as the global attention mechanism primarily captures long-range dependencies between nodes, these integration schemes may suffer from information loss, where the local neighborhood information learned by GNN could be diluted by the attention mechanism. Therefore, we propose G2LFormer, featuring a novel global-to-local attention scheme where the shallow network layers use attention mechanisms to capture global information, while the deeper layers employ GNN modules to learn local structural information, thereby preventing nodes from ignoring their immediate neighbors. An effective cross-layer information fusion strategy is introduced to allow local layers to retain beneficial information from global layers and alleviate information loss, with acceptable trade-offs in scalability. To validate the feasibility of the global-to-local attention scheme, we compare G2LFormer with state-of-the-art linear GTs and GNNs on node-level and graph-level tasks. The results indicate that G2LFormer exhibits excellent performance while keeping linear complexity.

Paper Structure

This paper contains 20 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Categories of attention schemes in which local layers stand for GNN and global layers stand for the attention mechanism. (a) Local layers and global layers learn node representations in parallel and aggregate their outputs in a specific way; (b) Local-to-global: Local layers precede global layers, with the attention mechanism learning the final node representations; (c) Global-to-local: Global layers precede local layers, with GNN learning the final node representations. Notably, while (a)-(b) are established approaches, scheme (c) remains underexplored.
  • Figure 2: The general framework of global-to-local attention scheme. "Filter" is equivalent to the operation of preserving critical node information in cross-layer information fusion strategy. Both local and global layers are modular components that admit substitution with other backbone models.
  • Figure 3: The process of the cross-layer information fusion strategy adopted by G2LFormer. The "Filter" corresponds to $\mathcal{F}_{f}$, which is identical to the "Filter" depicted in Figure \ref{['framework']}.
  • Figure 4: Ablation study: 'with strategy' uses cross-layer information fusion strategy; 'no strategy' omits it.
  • Figure 5: Scalability test of training time and GPU memory usage.