Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

Chenhui Deng; Zichao Yue; Zhiru Zhang

Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

Chenhui Deng, Zichao Yue, Zhiru Zhang

TL;DR

Polynormer tackles the scalability gap in graph transformers by marrying high-degree polynomial expressivity with linear-time computation. It introduces a polynomial-expressive base model and derives permutation-equivariant local and global attention modules, assembled in a local-to-global architecture that maintains linear complexity. The approach yields a $L$-layer model that can express a polynomial of degree $2^L$, and experiments show strong performance across 13 datasets, including large-scale graphs, even without nonlinear activations (and with gains up to ~4% with activation). This work demonstrates a practical pathway to highly expressive, scalable graph transformers suitable for real-world, large graphs.

Abstract

Graph transformers (GTs) have emerged as a promising architecture that is theoretically more expressive than message-passing graph neural networks (GNNs). However, typical GT models have at least quadratic complexity and thus cannot scale to large graphs. While there are several linear GTs recently proposed, they still lag behind GNN counterparts on several popular graph datasets, which poses a critical concern on their practical expressivity. To balance the trade-off between expressivity and scalability of GTs, we propose Polynormer, a polynomial-expressive GT model with linear complexity. Polynormer is built upon a novel base model that learns a high-degree polynomial on input features. To enable the base model permutation equivariant, we integrate it with graph topology and node features separately, resulting in local and global equivariant attention models. Consequently, Polynormer adopts a linear local-to-global attention scheme to learn high-degree equivariant polynomials whose coefficients are controlled by attention scores. Polynormer has been evaluated on $13$ homophilic and heterophilic datasets, including large graphs with millions of nodes. Our extensive experiment results show that Polynormer outperforms state-of-the-art GNN and GT baselines on most datasets, even without the use of nonlinear activation functions.

Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

TL;DR

-layer model that can express a polynomial of degree

, and experiments show strong performance across 13 datasets, including large-scale graphs, even without nonlinear activations (and with gains up to ~4% with activation). This work demonstrates a practical pathway to highly expressive, scalable graph transformers suitable for real-world, large graphs.

Abstract

homophilic and heterophilic datasets, including large graphs with millions of nodes. Our extensive experiment results show that Polynormer outperforms state-of-the-art GNN and GT baselines on most datasets, even without the use of nonlinear activation functions.

Paper Structure (33 sections, 1 theorem, 19 equations, 6 figures, 9 tables)

This paper contains 33 sections, 1 theorem, 19 equations, 6 figures, 9 tables.

Introduction
Background
Methodology
A Polynomial-Expressive Base Model with Attention
Equivariant Attention Models with Polynomial Expressivity
The Polynormer Architecture
Experiments
Performance on Homophilic and Heterophilic Graphs
Performance on Large Graphs
Ablation Analysis on Polynormer Attention Schemes
Visualization
Conclusions
Proof for Theorem \ref{['general_case']}
Analysis on f in Equation \ref{['eqn:exp_poly']}
Polynomial Expressivity of Prior Graph Models
...and 18 more sections

Key Result

Theorem 3.3

An $L$-layer base model $\mathcal{P}$ is $2^L$-polynomial-expressive.

Figures (6)

Figure 1: A toy example on a 3-node graph with scalar node features.
Figure 2: Distinctions between the attention schemes in previous work and Polynormer --- (a) Prior GTs use a local-and-global attention scheme, involving the simultaneous use of local and global attentions in each layer; (b) We adopt a local-to-global attention scheme, where local and global attention modules are applied sequentially; Note that the $softmax$ operator in (a) and the attention normalization in (b) are omitted for brevity.
Figure 3: Ablation studies on attention modules of Polynormer --- "Local Attention" means only the local attention module is used. "Local-and-Global Attention" denotes the local and global attention modules are employed in parallel, while "Local-to-Global Attention" represents our proposed Polynormer model where the local attention module is followed by the global attention module.
Figure 4: Visualization on the importance of nodes (columns) to each target node (row) --- Higher heatmap values indicate greater importance; Both subfigures (a) and (b) consider nodes are important if they share the same label as the target node, while (a) has an additional constraint that these nodes are at most $5$-hop away from the target node; Subfigure (c) measures node importance based on the corresponding global attention scores in Polynormer.
Figure 5: GPU memory usage and training time of Polynormer on synthetic graphs.
...and 1 more figures

Theorems & Definitions (4)

Definition 3.1
Definition 3.2
Theorem 3.3
proof

Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

TL;DR

Abstract

Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)