Table of Contents
Fetching ...

When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning

Arindam Chowdhury, Massimiliano Lupo Pasini

TL;DR

The paper tackles the question of when global attention helps atomistic graph learning and provides a unified, reproducible benchmark to dissect the roles of local message passing, encoders, and global attention. By evaluating four configurations within a HydraGNN-based framework across seven diverse datasets, it shows that encoder-based augmentations robustly improve local-property predictions while fused local–global models yield the clearest benefits for long-range interaction regimes, all under explicit compute-cost considerations. The study delivers practical guidelines: use encoder-augmented MPNNs by default, add moderate global attention for nonlocal tasks, and prefer modest attention budgets to maintain parameter efficiency. This work establishes a principled, replicable benchmark, enabling fair comparisons and informing future method development in atomistic graph learning.

Abstract

Graph neural networks (GNNs) are widely used as surrogates for costly experiments and first-principles simulations to study the behavior of compounds at atomistic scale, and their architectural complexity is constantly increasing to enable the modeling of complex physics. While most recent GNNs combine more traditional message passing neural networks (MPNNs) layers to model short-range interactions with more advanced graph transformers (GTs) with global attention mechanisms to model long-range interactions, it is still unclear when global attention mechanisms provide real benefits over well-tuned MPNN layers due to inconsistent implementations, features, or hyperparameter tuning. We introduce the first unified, reproducible benchmarking framework - built on HydraGNN - that enables seamless switching among four controlled model classes: MPNN, MPNN with chemistry/topology encoders, GPS-style hybrids of MPNN with global attention, and fully fused local - global models with encoders. Using seven diverse open-source datasets for benchmarking across regression and classification tasks, we systematically isolate the contributions of message passing, global attention, and encoder-based feature augmentation. Our study shows that encoder-augmented MPNNs form a robust baseline, while fused local-global models yield the clearest benefits for properties governed by long-range interaction effects. We further quantify the accuracy - compute trade-offs of attention, reporting its overhead in memory. Together, these results establish the first controlled evaluation of global attention in atomistic graph learning and provide a reproducible testbed for future model development.

When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning

TL;DR

The paper tackles the question of when global attention helps atomistic graph learning and provides a unified, reproducible benchmark to dissect the roles of local message passing, encoders, and global attention. By evaluating four configurations within a HydraGNN-based framework across seven diverse datasets, it shows that encoder-based augmentations robustly improve local-property predictions while fused local–global models yield the clearest benefits for long-range interaction regimes, all under explicit compute-cost considerations. The study delivers practical guidelines: use encoder-augmented MPNNs by default, add moderate global attention for nonlocal tasks, and prefer modest attention budgets to maintain parameter efficiency. This work establishes a principled, replicable benchmark, enabling fair comparisons and informing future method development in atomistic graph learning.

Abstract

Graph neural networks (GNNs) are widely used as surrogates for costly experiments and first-principles simulations to study the behavior of compounds at atomistic scale, and their architectural complexity is constantly increasing to enable the modeling of complex physics. While most recent GNNs combine more traditional message passing neural networks (MPNNs) layers to model short-range interactions with more advanced graph transformers (GTs) with global attention mechanisms to model long-range interactions, it is still unclear when global attention mechanisms provide real benefits over well-tuned MPNN layers due to inconsistent implementations, features, or hyperparameter tuning. We introduce the first unified, reproducible benchmarking framework - built on HydraGNN - that enables seamless switching among four controlled model classes: MPNN, MPNN with chemistry/topology encoders, GPS-style hybrids of MPNN with global attention, and fully fused local - global models with encoders. Using seven diverse open-source datasets for benchmarking across regression and classification tasks, we systematically isolate the contributions of message passing, global attention, and encoder-based feature augmentation. Our study shows that encoder-augmented MPNNs form a robust baseline, while fused local-global models yield the clearest benefits for properties governed by long-range interaction effects. We further quantify the accuracy - compute trade-offs of attention, reporting its overhead in memory. Together, these results establish the first controlled evaluation of global attention in atomistic graph learning and provide a reproducible testbed for future model development.

Paper Structure

This paper contains 33 sections, 18 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Geometric atomistic graph with cutoff $r_c$.
  • Figure 2: A generic MPNN layer $k$ is detailed. $\mathbf{h}_u = [\mathbf{H}]_u$
  • Figure 3: Flow diagram depicting the computational steps in our proposed framework. It depicts a $K-$layered model. Each block operates parallelly on individual graphs. Four independent pipelines are denoted by configurations of the switches S1 and S2. Ex., if S1 and S2 are both open, HydraGNN pipeline is obtained. And, if S1 and S2 are both closed, global attention module is fused with local message-passing module while encoders provide domain-specific and positional information to both modules. Input features are embedded to suitable subspaces before feeding to the learnable modules. Output of layer $K$ is used for downstream tasks through hyper-parameter optimization.
  • Figure 4: ZINC parity plots (Predicted vs. True logP) for the best HPO trial per scheme
  • Figure 5: QM9 parity plots (Predicted vs. True free energy) for the best HPO trial per scheme
  • ...and 3 more figures