Table of Contents
Fetching ...

Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar

Abstract

Graph transformers are the state-of-the-art for learning from graph-structured data and are empirically known to avoid several pitfalls of message-passing architectures. However, there is limited theoretical analysis on why these models perform well in practice. In this work, we prove that attention-based architectures have structural benefits over graph convolutional networks in the context of node-level prediction tasks. Specifically, we study the neural network gaussian process limits of graph transformers (GAT, Graphormer, Specformer) with infinite width and infinite heads, and derive the node-level and edge-level kernels across the layers. Our results characterise how the node features and the graph structure propagate through the graph attention layers. As a specific example, we prove that graph transformers structurally preserve community information and maintain discriminative node representations even in deep layers, thereby preventing oversmoothing. We provide empirical evidence on synthetic and real-world graphs that validate our theoretical insights, such as integrating informative priors and positional encoding can improve performance of deep graph transformers.

Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

Abstract

Graph transformers are the state-of-the-art for learning from graph-structured data and are empirically known to avoid several pitfalls of message-passing architectures. However, there is limited theoretical analysis on why these models perform well in practice. In this work, we prove that attention-based architectures have structural benefits over graph convolutional networks in the context of node-level prediction tasks. Specifically, we study the neural network gaussian process limits of graph transformers (GAT, Graphormer, Specformer) with infinite width and infinite heads, and derive the node-level and edge-level kernels across the layers. Our results characterise how the node features and the graph structure propagate through the graph attention layers. As a specific example, we prove that graph transformers structurally preserve community information and maintain discriminative node representations even in deep layers, thereby preventing oversmoothing. We provide empirical evidence on synthetic and real-world graphs that validate our theoretical insights, such as integrating informative priors and positional encoding can improve performance of deep graph transformers.
Paper Structure (61 sections, 36 theorems, 191 equations, 5 figures, 7 tables)

This paper contains 61 sections, 36 theorems, 191 equations, 5 figures, 7 tables.

Key Result

Corollary 1

For the GCN kernel presented in Corollary cor:GCN_kernel_SBM (and in Table tab:sbm_summary). The normalized kernel converges to the all-ones matrix: Consequently, $\text{rank}\left(\lim_{\ell \to \infty} \frac{K^{(\ell)}}{\frac{1}{n}tr(K^{(\ell)})}\right) = 1$, indicating that GCNs suffer from complete oversmoothing.

Figures (5)

  • Figure 1: Histogram of the output of GAT and Graphormer for different number of heads. The output distribution converges to a Gaussian (red line) fitted with mean and variance of the empirical distribution when both width and number of heads are large. Plots for Specformer is in Appendix \ref{['app:experimental details']}.
  • Figure 2: Oversmoothing behaviour and the impact of positional encodings. Test accuracy of Graphormer-GP on Chameleon (left) and SBM with random features (right) as a function of the number of layers. While accuracy in graph models typically suffers from performance degradation with an increasing number of layers, Graphormer-GP exhibits increasing accuracy with depth when utilizing informative positional encodings, such as Laplacian Eigenvectors (blue) and Spectral Reconstruction (orange). This empirical trend aligns with the theoretical results in Corollary \ref{['rem:Graphormer_convergence']}.
  • Figure 3: Oversmoothing behaviour across various GNN-GP architectures. Test accuracy is shown as a function of depth for the original models (left) and models augmented with Laplacian Eigenvectors (right). In the original configurations, GCN-GP (blue) consistently suffers from a performance drop-off as the number of layers increases. In contrast, GAT-GP (orange) demonstrates resilience to depth, maintaining stable performance as argued in Corollary \ref{['cor:GAT_no_oversmoothing']}. The behavior of Specformer-GP and Graphormer-GP depends on the dataset. Notably, upon incorporating Laplacian Eigenvectors (right), the tendency to oversmooth is significantly mitigated across all architectures.
  • Figure 4: Histogram of the eigenvalue encoding of Specformer for different number of heads. The output distribution converges to a Gaussian (red line) fitted with mean and variance of the empirical distribution when both width and number of heads are large.
  • Figure 5: Oversmoothing behaviour and the impact of positional encodings. Test accuracy of Graphormer-GP on Pubmed (left) and CSBM (right) as a function of the number of layers.

Theorems & Definitions (56)

  • Corollary 1: Rank collapse in GCN-GP indicates oversmoothing
  • Corollary 2: GAT-GP avoids rank collapse indicating the preservation of discriminative community structure
  • Remark 3: Graphormer Convergence
  • Remark 4: Specformer Collapses to GCN
  • Remark 5
  • Theorem 6
  • Corollary 7
  • Corollary 8: Linear GAT kernel
  • Theorem 9
  • Corollary 10: Closed-form Graphormer kernel under linear attention
  • ...and 46 more