Table of Contents
Fetching ...

The Importance of Being Scalable: Improving the Speed and Accuracy of Neural Network Interatomic Potentials Across Chemical Domains

Eric Qu, Aditi S. Krishnapriyan

TL;DR

The findings indicate that scaling the model through attention mechanisms is efficient and improves model expressivity, and motivate us to develop an NNIP architecture designed for scalability: the Efficiently Scaled Attention Interatomic Potential (EScAIP).

Abstract

Scaling has been critical in improving model performance and generalization in machine learning. It involves how a model's performance changes with increases in model size or input data, as well as how efficiently computational resources are utilized to support this growth. Despite successes in other areas, the study of scaling in Neural Network Interatomic Potentials (NNIPs) remains limited. NNIPs act as surrogate models for ab initio quantum mechanical calculations. The dominant paradigm here is to incorporate many physical domain constraints into the model, such as rotational equivariance. We contend that these complex constraints inhibit the scaling ability of NNIPs, and are likely to lead to performance plateaus in the long run. In this work, we take an alternative approach and start by systematically studying NNIP scaling strategies. Our findings indicate that scaling the model through attention mechanisms is efficient and improves model expressivity. These insights motivate us to develop an NNIP architecture designed for scalability: the Efficiently Scaled Attention Interatomic Potential (EScAIP). EScAIP leverages a multi-head self-attention formulation within graph neural networks, applying attention at the neighbor-level representations. Implemented with highly-optimized attention GPU kernels, EScAIP achieves substantial gains in efficiency--at least 10x faster inference, 5x less memory usage--compared to existing NNIPs. EScAIP also achieves state-of-the-art performance on a wide range of datasets including catalysts (OC20 and OC22), molecules (SPICE), and materials (MPTrj). We emphasize that our approach should be thought of as a philosophy rather than a specific model, representing a proof-of-concept for developing general-purpose NNIPs that achieve better expressivity through scaling, and continue to scale efficiently with increased computational resources and training data.

The Importance of Being Scalable: Improving the Speed and Accuracy of Neural Network Interatomic Potentials Across Chemical Domains

TL;DR

The findings indicate that scaling the model through attention mechanisms is efficient and improves model expressivity, and motivate us to develop an NNIP architecture designed for scalability: the Efficiently Scaled Attention Interatomic Potential (EScAIP).

Abstract

Scaling has been critical in improving model performance and generalization in machine learning. It involves how a model's performance changes with increases in model size or input data, as well as how efficiently computational resources are utilized to support this growth. Despite successes in other areas, the study of scaling in Neural Network Interatomic Potentials (NNIPs) remains limited. NNIPs act as surrogate models for ab initio quantum mechanical calculations. The dominant paradigm here is to incorporate many physical domain constraints into the model, such as rotational equivariance. We contend that these complex constraints inhibit the scaling ability of NNIPs, and are likely to lead to performance plateaus in the long run. In this work, we take an alternative approach and start by systematically studying NNIP scaling strategies. Our findings indicate that scaling the model through attention mechanisms is efficient and improves model expressivity. These insights motivate us to develop an NNIP architecture designed for scalability: the Efficiently Scaled Attention Interatomic Potential (EScAIP). EScAIP leverages a multi-head self-attention formulation within graph neural networks, applying attention at the neighbor-level representations. Implemented with highly-optimized attention GPU kernels, EScAIP achieves substantial gains in efficiency--at least 10x faster inference, 5x less memory usage--compared to existing NNIPs. EScAIP also achieves state-of-the-art performance on a wide range of datasets including catalysts (OC20 and OC22), molecules (SPICE), and materials (MPTrj). We emphasize that our approach should be thought of as a philosophy rather than a specific model, representing a proof-of-concept for developing general-purpose NNIPs that achieve better expressivity through scaling, and continue to scale efficiently with increased computational resources and training data.

Paper Structure

This paper contains 45 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Efficiency, performance, and scaling comparisons between EScAIP and baseline models on the Open Catalyst dataset (OC20). Force MAE Error (meV/Å $\ \downarrow$) vs. Inference Speed (Sample/Sec $\uparrow$) and Force MAE vs. Memory (GB/Sample $\downarrow$) is reported. Results with Energy MAE can be found in the Appendix Fig. \ref{['fig:speed_energy']}. EScAIP achieves better performance with smaller time and memory cost.
  • Figure 2: Results of ablation study of EquiformerV2 liao2023equiformerv2 on the OC20 2M dataset. Energy (eV) and force (eV/Å) mean absolute error (MAE) are reported, along with the model's parameter counts. The leftmost column shows the original results from liao2023equiformerv2, where different $L$ had a different number of trainable parameters. We look at scaling parameters through the attention mechanisms (AT) and spherical channels (SC) for the original $L=2$ and $L=4$ models, such that the number of parameters is approximately equal to the original $L=6$ model. Scaling parameters in different ways affects the overall energy and forces error, and increasing attention parameters is particularly effective in improving model performance (More AT). We also modify the architecture to be invariant ($L=0$), allowing us to examine the effects of excluding rotational equivariance while controlling for the number of parameters (BOO). After controlling for parameter counts, many of the models have comparable error to the original $L=6$ model.
  • Figure 3: Illustration of the Efficiently Scaled Attention Interatomic Potential (EScAIP) model architecture. The model consists of $B$ graph attention blocks (dashed box), each of which contains a graph attention layer, a feed forward layer, and two readout layers for node and edge features. The concatenated readouts from each block are used to predict per-atom forces and system energy.
  • Figure 4: Detailed illustration of the graph attention block. The input attributes are projected and concatenated into a large message tensor. The tensor is fed into an optimized multi-head self-attention computation, where the max number of neighbors dimension is the sequence length dimension.
  • Figure 5: Inference runtime and memory usage comparison of EScAIP and baseline models on the OC20 dataset. Mean and standard deviation are reported across 16 randomly sampled batches per batch size. Grey lines indicate the cumulative number of atoms in the batch. EScAIP not only scales efficiently with batch size, but also exhibits minimal variation in performance across different batches. All reported results are tested on NVIDIA V100 32G.
  • ...and 3 more figures