Exploring the Global-to-Local Attention Scheme in Graph Transformers: An Empirical Study
Gang Wu, Zhengwei Wang
TL;DR
Graph Transformers often suffer from over-globalizing tendencies that neglect local neighborhood information. This work proposes G2LFormer, a global-to-local attention model where shallow global attention captures long-range dependencies and deeper local GNNs learn local structure, connected by a NOSAF-based cross-layer fusion to prevent information loss. Using SGFormer as the global backbone and Cluster-GCN or GatedGCN as local backbones, G2LFormer achieves state-of-the-art results on node- and graph-level tasks while maintaining linear complexity $O(N+|\mathcal{E}|)$ and a fusion cost of $O(N d' d'')$, which reduces to $O(N)$ for small $d',d''$. Experiments across diverse datasets demonstrate strong performance and scalability, highlighting the viability of global-to-local architectures for efficient, expressive graph representation learning.
Abstract
Graph Transformers (GTs) show considerable potential in graph representation learning. The architecture of GTs typically integrates Graph Neural Networks (GNNs) with global attention mechanisms either in parallel or as a precursor to attention mechanisms, yielding a local-and-global or local-to-global attention scheme. However, as the global attention mechanism primarily captures long-range dependencies between nodes, these integration schemes may suffer from information loss, where the local neighborhood information learned by GNN could be diluted by the attention mechanism. Therefore, we propose G2LFormer, featuring a novel global-to-local attention scheme where the shallow network layers use attention mechanisms to capture global information, while the deeper layers employ GNN modules to learn local structural information, thereby preventing nodes from ignoring their immediate neighbors. An effective cross-layer information fusion strategy is introduced to allow local layers to retain beneficial information from global layers and alleviate information loss, with acceptable trade-offs in scalability. To validate the feasibility of the global-to-local attention scheme, we compare G2LFormer with state-of-the-art linear GTs and GNNs on node-level and graph-level tasks. The results indicate that G2LFormer exhibits excellent performance while keeping linear complexity.
