Table of Contents
Fetching ...

GraphiT: Encoding Graph Structure in Transformers

Grégoire Mialon, Dexiong Chen, Margot Selosse, Julien Mairal

TL;DR

GraphiT shows that encoding graph structure within Transformer attention via kernel-based relative positioning and substructure features can outperform traditional GNNs on diverse graph tasks. By integrating diffusion and random-walk kernels for position-aware attention and augmenting node features with GCKN-derived substructures, GraphiT achieves strong accuracy on benchmarks like MUTAG, PROTEINS, PTC, NCI1, and ZINC. The model also provides interpretable attention maps that reveal discriminative graph motifs, supporting scientific applications where motif discovery is valuable. The results suggest that combining global attention with principled structure encodings offers a promising direction for scalable, interpretable graph representation learning and may benefit from further large-scale pretraining.

Abstract

We show that viewing graphs as sets of node features and incorporating structural and positional information into a transformer architecture is able to outperform representations learned with classical graph neural networks (GNNs). Our model, GraphiT, encodes such information by (i) leveraging relative positional encoding strategies in self-attention scores based on positive definite kernels on graphs, and (ii) enumerating and encoding local sub-structures such as paths of short length. We thoroughly evaluate these two ideas on many classification and regression tasks, demonstrating the effectiveness of each of them independently, as well as their combination. In addition to performing well on standard benchmarks, our model also admits natural visualization mechanisms for interpreting graph motifs explaining the predictions, making it a potentially strong candidate for scientific applications where interpretation is important. Code available at https://github.com/inria-thoth/GraphiT.

GraphiT: Encoding Graph Structure in Transformers

TL;DR

GraphiT shows that encoding graph structure within Transformer attention via kernel-based relative positioning and substructure features can outperform traditional GNNs on diverse graph tasks. By integrating diffusion and random-walk kernels for position-aware attention and augmenting node features with GCKN-derived substructures, GraphiT achieves strong accuracy on benchmarks like MUTAG, PROTEINS, PTC, NCI1, and ZINC. The model also provides interpretable attention maps that reveal discriminative graph motifs, supporting scientific applications where motif discovery is valuable. The results suggest that combining global attention with principled structure encodings offers a promising direction for scalable, interpretable graph representation learning and may benefit from further large-scale pretraining.

Abstract

We show that viewing graphs as sets of node features and incorporating structural and positional information into a transformer architecture is able to outperform representations learned with classical graph neural networks (GNNs). Our model, GraphiT, encodes such information by (i) leveraging relative positional encoding strategies in self-attention scores based on positive definite kernels on graphs, and (ii) enumerating and encoding local sub-structures such as paths of short length. We thoroughly evaluate these two ideas on many classification and regression tasks, demonstrating the effectiveness of each of them independently, as well as their combination. In addition to performing well on standard benchmarks, our model also admits natural visualization mechanisms for interpreting graph motifs explaining the predictions, making it a potentially strong candidate for scientific applications where interpretation is important. Code available at https://github.com/inria-thoth/GraphiT.

Paper Structure

This paper contains 55 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Examples of molecules from Mutagenicity correctly classified as mutagenetic by our model.
  • Figure 2: Attention scores averaged by heads for each layer of our trained model for the compounds in Figures \ref{['fig:NO2']} (Top) and \ref{['fig:NH2']} (Bottom). Top Left: diffusion kernel for \ref{['fig:NO2']}. Top Right: node $8$ (N of NO$_2$) is salient. Bottom Left: diffusion kernel for \ref{['fig:NH2']}. Bottom Right: node $14$ (N of NH$_2$) is salient.
  • Figure 3: 1,2-Dibromo-3-Chloropropane.
  • Figure 4: Attention scores averaged by heads for each layer of our trained model for the compound in Figure \ref{['fig:dbcp']}. Left: diffusion kernel for \ref{['fig:dbcp']}. Right: node $3$ and $5$ (Br) are salient.
  • Figure 5: Nitrobenzene-nitroimidazothiazole.
  • ...and 3 more figures