Table of Contents
Fetching ...

GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus

Zhaolin Hu, Kun Li, Hehe Fan, Yi Yang

TL;DR

GraphTARIF tackles the expressiveness gap in linear Graph Transformers by addressing two core issues: low-rank attention and high entropy. It introduces a gated local GAT-augmented branch to raise the effective rank of the attention map and a learnable log-power sharpening function to reduce entropy, complemented by a node-wise post-modulation that sharpens representations. Theoretical results connect rank to inter-class separability and show entropy reduction improves discriminability, while experiments across homophilic, heterophilic, and large-scale graphs demonstrate competitive accuracy with linear scalability. The approach yields practical benefits for Web-scale graph tasks, balancing efficiency and expressiveness in graph learning.

Abstract

Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model's classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.

GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus

TL;DR

GraphTARIF tackles the expressiveness gap in linear Graph Transformers by addressing two core issues: low-rank attention and high entropy. It introduces a gated local GAT-augmented branch to raise the effective rank of the attention map and a learnable log-power sharpening function to reduce entropy, complemented by a node-wise post-modulation that sharpens representations. Theoretical results connect rank to inter-class separability and show entropy reduction improves discriminability, while experiments across homophilic, heterophilic, and large-scale graphs demonstrate competitive accuracy with linear scalability. The approach yields practical benefits for Web-scale graph tasks, balancing efficiency and expressiveness in graph learning.

Abstract

Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model's classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.

Paper Structure

This paper contains 35 sections, 3 theorems, 19 equations, 10 figures, 6 tables.

Key Result

Theorem 1

Let ${\bm{X}} \in \mathbb{R}^{n \times d}$ be the node feature matrix and ${\bm{M}} \in \mathbb{R}^{n \times n}$ an attention matrix applied to transform the embeddings. Suppose that the rows of ${\bm{X}}$ are drawn from a Gaussian mixture model. Then, the expected inter-class variance after applyin where $r$ is the rank of the attention matrix ${\bm{M}}$, and $C$ is a constant that depends on the

Figures (10)

  • Figure 1: Comparisons among Graph Transformers(GT) in terms of accuracy, runtime, and GPU memory usage on Minesweeper. The proposed GraphTARIF achieves consistently superior accuracy with shorter runtime, demonstrating both effectiveness and efficiency.
  • Figure 2: (a) Node classification performance on homophilic graphs (WikiCS, CS) and a heterophilic graph (Toloker) shows that softmax attention consistently outperforms linear attention. (b) Visualization of normalized attention matrices for 120 evenly-sampled nodes reveals that linear attention produces low-rank, high-entropy attention distributions across both homophilic and heterophilic graphs.
  • Figure 3: The overall framework of GraphTARIF. It mainly consists of GNN layers and high-rank linear attention. ① is the gated local attention that mitigates the low-rank limitation of linear attention. ② is a learnable log-power function $f(\cdot; p, q)$ that sharpens the attention distribution and reduces entropy, and ③ is a node-wise post-modulation module that produces clearer and more discriminative node representations. Here, ${\bm{Q}}, {\bm{K}}, {\bm{V}}$ denote the standard query, key, and value projections of node features ${\bm{X}}$, ${\bm{A}}$ is the graph adjacency matrix, and $\psi(\cdot)$ denotes a simple linear projection applied for node-wise post-modulation.
  • Figure 4: (a) Linear attention yields high-entropy attention maps with smooth, overly uniform outputs. (b) After applying the Learnable Log-Power Functions, the attention map becomes significantly sharper with lower entropy, allowing nodes to focus on more relevant regions and produce more diverse representations.
  • Figure 5: Training time and GPU memory usage of GraphTARIF on the pokec dataset.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 1: Positive Sequence Entropy (PSE) meng2025polaformer
  • Theorem 1
  • Theorem 2
  • Theorem 3