Table of Contents
Fetching ...

Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs

Batu El, Deepro Choudhury, Pietro Liò, Chaitanya K. Joshi

TL;DR

The paper tackles mechanistic interpretability for GNNs and Graph Transformers by introducing Attention Graphs, a framework that aggregates self-attention matrices across heads and layers to map information flow among input nodes. It formalizes a design space for Graph Transformers along sparsity and parametrization, and proposes aggregation rules that produce a unified Attention Graph capturing multi-hop information flow. Empirical results reveal that unconstrained Graph Transformers can learn information-flow patterns that diverge from the input graph, and that on heterophilous graphs different architectures can achieve similar performance via distinct information-flow strategies. This work lays a foundation for network-science–driven interpretability of graph-based models, providing insights into algorithmic diversity and guiding future analyses on larger architectures and broader scientific tasks.

Abstract

We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns

Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs

TL;DR

The paper tackles mechanistic interpretability for GNNs and Graph Transformers by introducing Attention Graphs, a framework that aggregates self-attention matrices across heads and layers to map information flow among input nodes. It formalizes a design space for Graph Transformers along sparsity and parametrization, and proposes aggregation rules that produce a unified Attention Graph capturing multi-hop information flow. Empirical results reveal that unconstrained Graph Transformers can learn information-flow patterns that diverge from the input graph, and that on heterophilous graphs different architectures can achieve similar performance via distinct information-flow strategies. This work lays a foundation for network-science–driven interpretability of graph-based models, providing insights into algorithmic diversity and guiding future analyses on larger architectures and broader scientific tasks.

Abstract

We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns

Paper Structure

This paper contains 18 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Attention Graphs for mechanistic interpretability of GNNs and Graph Transformers. Left: Graph Neural Networks are equivalent to Transformers operating on fully connected graphs joshi2020transformers. Middle: The attention matrices at each layer and each head in the Transformer tell us how information flows among input tokens. Right: The attention matrices can be aggregated across layers and heads to construct a directed Attention Graph of information flow in the GNN/Graph Transformer. We can study Attention Graphs from a network science perspective to mechanistically understand the algorithms learned by GNNs and Graph Transformers.
  • Figure 2: Design space of Graph Transformers based on two key dimensions: (1) sparsity of attention (sparse vs. dense) and (2) parametrization of attention (constant vs. learned).
  • Figure 3: Aggregating attention across layers by matrix multiplication. Attention matrices from successive layers are combined to capture indirect information flow. For node $i$, row $i$ in the attention matrix $\mathbb{A}_{L_2}$ represents how much it attends to each intermediate node $j$. Each row $j$ in $\mathbb{A}_{L_1}$ captures how those intermediate nodes attend to other nodes $k$. Matrix multiplication $\mathbb{A}_{L_2}\mathbb{A}_{L_1}$ combines these patterns, revealing how node $i$ indirectly attends to node $k$ through intermediate nodes $j$.
  • Figure 4: Distribution of attention between neighbors and non-neighbors across different Graph Transformer architectures. For SL (Sparse Learned), DLB (Dense Learned with Bias), and DL (Dense Learned) models, we visualize the attention patterns for four configurations: 1-layer 1-head, 1-layer 2-head, 2-layer 1-head, and 2-layer 2-head. Each point represents the weight of attention paid by a node in the aggregated Attention Graph, and whether it attends to a neighbor or non-neighbor from the input graph. DLB models mostly attend to neighbors, while DL models distribute attention more uniformly between neighbors and non-neighbors.
  • Figure 5: Different graph inductive biases lead to distinct algorithmic strategies. We plot quasi-adjacency matrices derived from Attention Graphs for DLB and DL models across different datasets for the 2-layer 2-head configuration. Black squares indicate no edges in the thresholded Attention Graph, while white squares indicate edges. DLB models exhibit strong self-attention patterns (diagonal lines), suggesting they focus on initial node features rather than aggregating information from neighbors. DL models develop reference nodes (vertical lines) that receive high attention from all other nodes, suggesting a classification algorithm based on comparing nodes against these references. See \ref{['fig:HM']} for all model configurations and datasets.
  • ...and 4 more figures