Table of Contents
Fetching ...

Transformers Discover Molecular Structure Without Graph Priors

Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan

TL;DR

This work questions the necessity of graph priors for molecular modeling by training an unmodified Transformer directly on Cartesian coordinates. On OMol25, a $1\mathrm{B}$-parameter Transformer achieves competitive energy and force MAEs compared to a state-of-the-art equivariant GNN while offering faster training and inference, underscoring the practical benefits of standard Transformer architectures. The study reveals that the Transformer learns physically meaningful patterns, such as inverse-distance attention and adaptive receptive fields, and it exhibits predictable scaling laws akin to other ML domains. Together, these results suggest that graph inductive biases can emerge from data with scalable architectures, motivating broader adoption of graph-free Transformers for molecular modeling and potential extensions to MD and uncertainty quantification.

Abstract

Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinates$\unicode{x2013}$without predefined graphs or physical priors$\unicode{x2013}$can approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patterns$\unicode{x2013}$such as attention weights that decay inversely with interatomic distance$\unicode{x2013}$and flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.

Transformers Discover Molecular Structure Without Graph Priors

TL;DR

This work questions the necessity of graph priors for molecular modeling by training an unmodified Transformer directly on Cartesian coordinates. On OMol25, a -parameter Transformer achieves competitive energy and force MAEs compared to a state-of-the-art equivariant GNN while offering faster training and inference, underscoring the practical benefits of standard Transformer architectures. The study reveals that the Transformer learns physically meaningful patterns, such as inverse-distance attention and adaptive receptive fields, and it exhibits predictable scaling laws akin to other ML domains. Together, these results suggest that graph inductive biases can emerge from data with scalable architectures, motivating broader adoption of graph-free Transformers for molecular modeling and potential extensions to MD and uncertainty quantification.

Abstract

Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinateswithout predefined graphs or physical priorscan approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patternssuch as attention weights that decay inversely with interatomic distanceand flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.

Paper Structure

This paper contains 35 sections, 3 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Graph-Free Transformers Model Design. Our model encodes both discretized and continuous molecular sequences using unmodified Transformer layers. Placeholder values represent discrete inputs in the continuous sequence.
  • Figure 2: Discretization Scheme during Pre-training. We transform the standard ".xyz" molecular representation into a discretized input sequence for our model. To discretize continuous values, we use quantile binning, ensuring each bin contains the same number of datapoints. Atomic positions are jointly discretized into a 3D grid, while force and energy components are discretized independently along each dimension. We add special tokens, like beginning and end of sequence tokens. Note that these discretized tokens are accompanied by the continuous values (for positions, energies, forces, etc.), allowing the model to circumvent discretization errors for real-valued inputs.
  • Figure 3: Transformers scale predictably with training resources when modeling molecules. (a) We pre-train models of varying sizes up to 1B parameters with other training hyperparameters held fixed. Evaluation performance improves in a clear power-law relationship with model size. (b) We train models of varying sizes (5M, 30M, 90M, 350M) for differing numbers of epochs (1,2,4,6) using our fine-tuning setup. We fit scaling laws with the three smaller models and make predictions about the performance of other model sizes trained for varying numbers of epochs. We plot predicted IsoFLOP curves, where smaller models on each curve are trained for more epochs and larger ones are trained for fewer epochs. Predicted IsoFLOP curves have a parabolic shape with the optimal model size and performance for each flop budget following a consistent power-law relationship, in line with previous work hoffmann2022trainingcomputeoptimallargelanguage. The isoflop curves accurately extrapolate to predict the performance of the larger 350M parameter model.
  • Figure 4: Transformers effectively capture local features in early layers and global features in later layers. (a) We show what fraction of attention from position tokens goes towards other position tokens versus global tokens, such as charge and spin. In the first nine layers, position tokens predominantly focus on other position tokens. In the later layers, attention shifts towards global tokens. (b) We plot attention scores for position tokens against interatomic distance, averaged across validation examples in OMol25. Each dot is the mean attention score within an interatomic distance quantile. Attention in the first nine layers is strongly inversely related to distance. In later layers, position tokens increase their attention to global tokens (e.g., spin and charge) while still allocating some attention to distant atoms. These results suggest that Transformers capture local features in the early layers and then aggregate global information in the final layers.
  • Figure 5: Relationship between attention effective radius and atom density. Averaging over atoms in the OMol25 validation set, we plot the effective attention radius versus the median distance to other atoms. Each dot is the mean effective radius within a median neighbor distance percentile. We define the effective radius as the minimum distance within which 90% of an atom's attention mass is concentrated (see \ref{['eqn:effective_r']}). The model learns to adaptively increase its effective attention radius when an atom is more isolated, and to decrease it when atoms are tightly packed.
  • ...and 10 more figures