Gradformer: Graph Transformer with Exponential Decay

Chuang Liu; Zelin Yao; Yibing Zhan; Xueqi Ma; Shirui Pan; Wenbin Hu

Gradformer: Graph Transformer with Exponential Decay

Chuang Liu, Zelin Yao, Yibing Zhan, Xueqi Ma, Shirui Pan, Wenbin Hu

TL;DR

Gradformer tackles the insufficiency of standard self-attention in Graph Transformers by injecting a graph-structure-based inductive bias through an exponential decay attention mask, $M=\lambda^{\psi(v_i,v_j)}$, with learnable, head-specific constraints. This design steers attention toward structurally proximal node pairs while preserving long-range interactions, effectively unifying local GNN-like processing with global GT capabilities. Across nine datasets, including OGB-MOLHIV, Gradformer achieves state-of-the-art or competitive performance, maintains accuracy as network depth increases, and demonstrates robustness under low-resource settings. The approach provides a scalable, principled way to encode graph structure into self-attention, offering advantages over prior position-encoding or bias-based methods and enabling more efficient and effective graph modeling.

Abstract

Graph Transformers (GTs) have demonstrated their advantages across a wide range of tasks. However, the self-attention mechanism in GTs overlooks the graph's inductive biases, particularly biases related to structure, which are crucial for the graph tasks. Although some methods utilize positional encoding and attention bias to model inductive biases, their effectiveness is still suboptimal analytically. Therefore, this paper presents Gradformer, a method innovatively integrating GT with the intrinsic inductive bias by applying an exponential decay mask to the attention matrix. Specifically, the values in the decay mask matrix diminish exponentially, correlating with the decreasing node proximities within the graph structure. This design enables Gradformer to retain its ability to capture information from distant nodes while focusing on the graph's local details. Furthermore, Gradformer introduces a learnable constraint into the decay mask, allowing different attention heads to learn distinct decay masks. Such an design diversifies the attention heads, enabling a more effective assimilation of diverse structural information within the graph. Extensive experiments on various benchmarks demonstrate that Gradformer consistently outperforms the Graph Neural Network and GT baseline models in various graph classification and regression tasks. Additionally, Gradformer has proven to be an effective method for training deep GT models, maintaining or even enhancing accuracy compared to shallow models as the network deepens, in contrast to the significant accuracy drop observed in other GT models.Codes are available at \url{https://github.com/LiuChuang0059/Gradformer}.

Gradformer: Graph Transformer with Exponential Decay

TL;DR

Gradformer tackles the insufficiency of standard self-attention in Graph Transformers by injecting a graph-structure-based inductive bias through an exponential decay attention mask,

, with learnable, head-specific constraints. This design steers attention toward structurally proximal node pairs while preserving long-range interactions, effectively unifying local GNN-like processing with global GT capabilities. Across nine datasets, including OGB-MOLHIV, Gradformer achieves state-of-the-art or competitive performance, maintains accuracy as network depth increases, and demonstrates robustness under low-resource settings. The approach provides a scalable, principled way to encode graph structure into self-attention, offering advantages over prior position-encoding or bias-based methods and enabling more efficient and effective graph modeling.

Abstract

Paper Structure (44 sections, 9 equations, 8 figures, 9 tables)

This paper contains 44 sections, 9 equations, 8 figures, 9 tables.

Introduction
Related Work
Graph Transformers.
Prior Knowledge in Graph Transformer.
Methodology
Preliminaries
Notations.
Graph Transformer.
Proposed Method: Gradformer
Architecture
Masking Attention with Exponential Decay.
Mask Decay with Learnable Constraints.
Discussion
Gradformer is a General Form of GNNs and GTs.
Computational Complexity.
...and 29 more sections

Figures (8)

Figure 1: Visualization of attention patterns in different GT models with two graphs from the OGBG-HIV dataset. From left to right: vanilla GT, GT with position encoding (w/ PE), GT with attention bias (w/ Bias), and GT with our proposed decay mask (ours).
Figure 2: The overview of the Gradformer framework and its comparison with existing methods. a) Vanilla: Vanilla self-attention mechanism serves as the baseline. b) With PE: In several works graph-transformer, the PE vector (i.e., $\mathbf{p}_i$) is concatenated with the input node features, which can be interpreted as introducing a bias in the attention score. c) With Attention Bias: Some methods graphormer-v1 incorporate an attention bias (i.e., $\phi(i,j)$) into the attention score calculation. This bias often derives from spatial information, such as the shortest path. d) Ours: Our method introduces an exponential decay mask that is multiplied with the attention scores. This mask is derived from the structural information of the graph. Moreover, different attention heads utilize distinct masks, made possible through learnable parameters. A comprehensive explanation of the symbols is provided in Section \ref{['sec:method']}.
Figure 3: The decay mask mechanism. The blue matrix represents the pairwise dot product, where the intensity of the blue color indicates the magnitude of the attention. The red matrix represents the decay mask, where the intensity of the red color indicates the magnitude of the mask. Once the mask is applied (indicated by red cells with diagonal stripes), the attention values in the masked cells become significantly attenuated. With this attention decay masking, the self-attention mechanism becomes more responsive to the graph's structural characteristics.
Figure 4: Left: The value of the decay mask matrix $\mathbf{M}$ varies with the node-wise distance, $\psi$. Please note that the configuration of an end point is unique to linear decay, whereas exponential decay does not require such a parameter. Furthermore, the start point (i.e., $sp$ in Eq. (\ref{['eq:mask-3']})) is a learnable parameter, as depicted in the right part. Right: The learned start points for different attention heads demonstrate variation across epochs during training on the ZINC dataset.
Figure 5: Ablation study of graph structure index.
...and 3 more figures

Gradformer: Graph Transformer with Exponential Decay

TL;DR

Abstract

Gradformer: Graph Transformer with Exponential Decay

Authors

TL;DR

Abstract

Table of Contents

Figures (8)