Table of Contents
Fetching ...

Transformers are Graph Neural Networks

Chaitanya K. Joshi

TL;DR

The paper formalizes a deep connection between Transformer architectures and Graph Neural Networks (GNNs) by showing that self-attention on a fully connected token graph performs graph-like message passing. It draws explicit parallels between Transformers and Graph Attention Networks, and discusses how positional encodings can inject graph structure without hard constraints, motivating Graph Transformers. The main contributions include a precise equivalence between multi-head self-attention and GNN message passing, and an argument that Transformers’ hardware efficiency underpins their practical dominance for graph-structured data. The work highlights the significance of hardware-aware design in representation learning and suggests directions that blend local, graph-inspired inductive biases with global attention mechanisms.

Abstract

We establish connections between the Transformer architecture, originally introduced for natural language processing, and Graph Neural Networks (GNNs) for representation learning on graphs. We show how Transformers can be viewed as message passing GNNs operating on fully connected graphs of tokens, where the self-attention mechanism capture the relative importance of all tokens w.r.t. each-other, and positional encodings provide hints about sequential ordering or structure. Thus, Transformers are expressive set processing networks that learn relationships among input elements without being constrained by apriori graphs. Despite this mathematical connection to GNNs, Transformers are implemented via dense matrix operations that are significantly more efficient on modern hardware than sparse message passing. This leads to the perspective that Transformers are GNNs currently winning the hardware lottery.

Transformers are Graph Neural Networks

TL;DR

The paper formalizes a deep connection between Transformer architectures and Graph Neural Networks (GNNs) by showing that self-attention on a fully connected token graph performs graph-like message passing. It draws explicit parallels between Transformers and Graph Attention Networks, and discusses how positional encodings can inject graph structure without hard constraints, motivating Graph Transformers. The main contributions include a precise equivalence between multi-head self-attention and GNN message passing, and an argument that Transformers’ hardware efficiency underpins their practical dominance for graph-structured data. The work highlights the significance of hardware-aware design in representation learning and suggests directions that blend local, graph-inspired inductive biases with global attention mechanisms.

Abstract

We establish connections between the Transformer architecture, originally introduced for natural language processing, and Graph Neural Networks (GNNs) for representation learning on graphs. We show how Transformers can be viewed as message passing GNNs operating on fully connected graphs of tokens, where the self-attention mechanism capture the relative importance of all tokens w.r.t. each-other, and positional encodings provide hints about sequential ordering or structure. Thus, Transformers are expressive set processing networks that learn relationships among input elements without being constrained by apriori graphs. Despite this mathematical connection to GNNs, Transformers are implemented via dense matrix operations that are significantly more efficient on modern hardware than sparse message passing. This leads to the perspective that Transformers are GNNs currently winning the hardware lottery.

Paper Structure

This paper contains 10 sections, 12 equations, 4 figures.

Figures (4)

  • Figure 1: Representation Learning for NLP. RNNs build representations one token at a time, which captures the sequential nature of language. Transformers build representations in parallel via attention mechanisms, which capture relative importance of words w.r.t. each-other.
  • Figure 2: A simple attention mechanism. Taking as input the representations of the token $h_{i}^{\ell}$ and the set of other tokens in the sentence $\{ h_{j}^{\ell} \;\ \forall j \in \mathcal{S} \}$, we compute the attention weights $w_{ij}$ denoting the relative importance for each pair $(i,j)$ through the dot-product followed by a softmax normalization. Finally, we produce the updated token representation $h_{i}^{\ell+1}$ by summing over the representations of tokens $\{ h_{j}^{\ell} \}$ weighted by the corresponding $w_{ij}$. Each token in parallel undergoes the same pipeline to update its representation.
  • Figure 3: A Transformer layer. A multi-head attention sub-layer computes the relative importance of each token in a sentence w.r.t. each other token, and updates their representations accordingly. The updated representations are then processed by a token-wise multi-layer perceptron (MLP) sub-layer. Modern variants of the Transformer use SwiGLU shazeer2020glu instead of ReLU glorot2011deep as the MLP's activation function, and apply the LayerNorm operations before the multi-head attention and feed-forward sub-layers, rather than after xiong2020layer.
  • Figure 4: Representation learning on graphs with message passing. (left) Graphs model complex systems via a set of nodes connected by edges. (middle) GNNs build latent representations of graph data via message passing, where each node learns to aggregate representations from its local neighbourhood. (right) Stacking $L$ message passing layers enables GNNs to send and aggregate information from $L$-hop subgraphs around each node.