A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

Eric Qu; Brandon M. Wood; Aditi S. Krishnapriyan; Zachary W. Ulissi

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

Eric Qu, Brandon M. Wood, Aditi S. Krishnapriyan, Zachary W. Ulissi

TL;DR

This work proposes AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples and achieves state-of-the-art energy/force accuracy on molecular systems, as well as a number of physics-based evaluations.

Abstract

Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading current approaches to rely on explicit physics-based terms or components. In this work, we propose AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples. It addresses the long-range challenge using an all-to-all node attention component that is data-driven. Extensive ablations reveal that in low-data/small-model regimes, inductive biases improve sample efficiency. However, as data and model size scale, these benefits diminish or even reverse, while all-to-all attention remains critical for capturing LR interactions. Our model achieves state-of-the-art energy/force accuracy on molecular systems, as well as a number of physics-based evaluations (OMol25), while being competitive on materials (OMat24) and catalysts (OC20). Furthermore, it enables stable, long-timescale MD simulations that accurately recover experimental observables, including density and heat of vaporization predictions.

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

TL;DR

Abstract

Paper Structure (46 sections, 2 equations, 12 figures, 9 tables)

This paper contains 46 sections, 2 equations, 12 figures, 9 tables.

Introduction
Related Works
Machine Learning Interatomic Potentials.
Long-range Interactions in MLIPs.
Methods
Attention Operations
Neighborhood Self-attention.
All-to-all Node Self-attention.
Geometric Encodings
Legendre Angular Encoding (LAE).
Euclidean Rotary Position Encoding (ERoPE).
AllScAIP Model
Inductive Biases
Ablations
Ablations on Model Components.
...and 31 more sections

Figures (12)

Figure 1: AllScAIP model design. The simple backbone design enables efficient scaling.
Figure 2: Illustration of the attention operations used in AllScAIP. (a) Neighbor attention. (b) Node attention.
Figure 3: Illustration of the geometric encoding used in AllScAIP. (a) Legendre Angular Encoding (LAE) (b) Euclidean rotary position encoding.
Figure 4: Throughput and Memory vs. System size. Left: Atom-time (number of atoms $\times$ ns/day, higher is better) vs. system size. The result is measured on a single H200 141G with graph generation off. We report four model sizes (35M/85M/180M/1B) of AllScAIP, and eSEN baselines. The dotted vertical lines indicate approximately when the $\mathcal{O}(N^2)$ of the node attention dominates over the $\mathcal{O}(Nk)$ of the neighborhood attention, where $k$ is the max number of neighbors. Right: vram usage vs. system size. Dotted horizontal lines indicate the vram size for common GPUs.
Figure 5: OMol25 energy error vs. efficiency throughput (inference). Energy / Atom MAE (meV, $\downarrow$) vs. throughput (ns/day, $\uparrow$) for 35M/85M models with/without encodings; compared with eSEN fu2025learning baselines. Conservative models are labeled with a hollow marker. Left: Models trained on OMol25 4M (80 epochs). Right: OMol25 102M (12 epochs). Our models trace the Pareto front at 4M; at 102M the gap between with/without encodings shrinks or flips, indicating these inductive bias may be unnecessary at scale. The larger speed gap between the direct force and conservative AllScAIP models, compared to eSEN, occurs because the differentiable kNN graph construction used in AllScAIP is a newly introduced operation that has not yet been optimized liu2026evaluation.
...and 7 more figures

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

TL;DR

Abstract

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

Authors

TL;DR

Abstract

Table of Contents

Figures (12)