Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs
Batu El, Deepro Choudhury, Pietro Liò, Chaitanya K. Joshi
TL;DR
The paper tackles mechanistic interpretability for GNNs and Graph Transformers by introducing Attention Graphs, a framework that aggregates self-attention matrices across heads and layers to map information flow among input nodes. It formalizes a design space for Graph Transformers along sparsity and parametrization, and proposes aggregation rules that produce a unified Attention Graph capturing multi-hop information flow. Empirical results reveal that unconstrained Graph Transformers can learn information-flow patterns that diverge from the input graph, and that on heterophilous graphs different architectures can achieve similar performance via distinct information-flow strategies. This work lays a foundation for network-science–driven interpretability of graph-based models, providing insights into algorithmic diversity and guiding future analyses on larger architectures and broader scientific tasks.
Abstract
We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns
