Table of Contents
Fetching ...

High-performance training and inference for deep equivariant interatomic potentials

Chuin Wei Tan, Marc L. Descoteaux, Mit Kotak, Gabriel de Miranda Nascimento, Seán R. Kavanagh, Laura Zichi, Menghang Wang, Aadit Saluja, Yizhong R. Hu, Tess Smidt, Anders Johansson, William C. Witt, Boris Kozinsky, Albert Musaelian

TL;DR

The paper tackles the scalability and performance bottlenecks of deep equivariant interatomic potentials by overhauling the NequIP framework for multi-node training and fast inference. It combines PyTorch 2.0 TorchInductor for end-to-end train-time compilation, a custom distributed data-parallel scheme, and Ahead-of-Time Inductor (AOTI) for efficient deployment in HPC codes, augmented by a fused Triton tensor-product kernel. In a SPICE 2 case study training Allegro models, the approach yields 2.4–5× training speedups and 4–18× inference speedups, enabling large-scale MD simulations with improved memory efficiency. The work delivers an extensible, HPC-ready platform for MLIPs that can accelerate materials discovery and biomolecular simulations through scalable, hardware-aware training and deployment.

Abstract

Machine learning interatomic potentials, particularly those based on deep equivariant neural networks, have demonstrated state-of-the-art accuracy and computational efficiency in atomistic modeling tasks like molecular dynamics and high-throughput screening. The size of datasets and demands of downstream workflows are growing rapidly, making robust and scalable software essential. This work presents a major overhaul of the NequIP framework focusing on multi-node parallelism, computational performance, and extensibility. The redesigned framework supports distributed training on large datasets and removes barriers preventing full utilization of the PyTorch 2.0 compiler at train time. We demonstrate this acceleration in a case study by training Allegro models on the SPICE 2 dataset of organic molecular systems. For inference, we introduce the first end-to-end infrastructure that uses the PyTorch Ahead-of-Time Inductor compiler for machine learning interatomic potentials. Additionally, we implement a custom kernel for the Allegro model's most expensive operation, the tensor product. Together, these advancements speed up molecular dynamics calculations on system sizes of practical relevance by up to a factor of 18.

High-performance training and inference for deep equivariant interatomic potentials

TL;DR

The paper tackles the scalability and performance bottlenecks of deep equivariant interatomic potentials by overhauling the NequIP framework for multi-node training and fast inference. It combines PyTorch 2.0 TorchInductor for end-to-end train-time compilation, a custom distributed data-parallel scheme, and Ahead-of-Time Inductor (AOTI) for efficient deployment in HPC codes, augmented by a fused Triton tensor-product kernel. In a SPICE 2 case study training Allegro models, the approach yields 2.4–5× training speedups and 4–18× inference speedups, enabling large-scale MD simulations with improved memory efficiency. The work delivers an extensible, HPC-ready platform for MLIPs that can accelerate materials discovery and biomolecular simulations through scalable, hardware-aware training and deployment.

Abstract

Machine learning interatomic potentials, particularly those based on deep equivariant neural networks, have demonstrated state-of-the-art accuracy and computational efficiency in atomistic modeling tasks like molecular dynamics and high-throughput screening. The size of datasets and demands of downstream workflows are growing rapidly, making robust and scalable software essential. This work presents a major overhaul of the NequIP framework focusing on multi-node parallelism, computational performance, and extensibility. The redesigned framework supports distributed training on large datasets and removes barriers preventing full utilization of the PyTorch 2.0 compiler at train time. We demonstrate this acceleration in a case study by training Allegro models on the SPICE 2 dataset of organic molecular systems. For inference, we introduce the first end-to-end infrastructure that uses the PyTorch Ahead-of-Time Inductor compiler for machine learning interatomic potentials. Additionally, we implement a custom kernel for the Allegro model's most expensive operation, the tensor product. Together, these advancements speed up molecular dynamics calculations on system sizes of practical relevance by up to a factor of 18.

Paper Structure

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Allegro model results for the SPICE 2 test seteastman2024nutmeg. Energy and force mean absolute error (MAE) of the three Allegro models on three different subsets of the SPICE 2 test set: (1) the original unrestricted test set, (2) a subset containing systems with neutral total charge, and (3) a subset of systems that contain only atomic species with neutral per-atom formal charge. The error metrics are grouped by system type: large ligands, small ligands, peptides, and dimers.
  • Figure 2: Scalability of distributed machine-learning interatomic potential training. Average time per epoch of training across a range of numbers of MPI ranks, where the per-rank local batch size is eight atomic configurations. Times are measured for training with TorchScript or torch.compile. Plots are shown for both NVIDIA A100 (80GB) and AMD MI205X GPUs. Note that one MPI rank corresponds to one of the two available graphics compute dies on a single MI250X device.
  • Figure 3: Single-rank inference acceleration on small molecule systems. The inference speed in LAMMPS of the small, medium, and large Allegro models deployed using TorchScript, AOTI, or AOTI with the optimized tensor product kernel (AOTI + custom TP) on small molecule systems ranging from 25 to 100 atoms without periodic boundary conditions eastman2024nutmeg, for AMD MI250X, NVIDIA A100 (80GB), and NVIDIA H100 GPUs. The inference speeds were averaged over three runs with different random seeds for the initial velocities generated by LAMMPS. One MPI rank was used. Note that one MPI rank corresponds to one of the two available graphics compute dies on an MI250X device.
  • Figure 4: Single-rank inference acceleration for periodic water boxes. Inference speed in LAMMPS for the small, medium, and large Allegro models deployed using TorchScript, AOTI, and AOTI with the optimized tensor product kernel (AOTI + custom TP) for liquid water boxes ranging from 24 to 5184 atoms with periodic boundary conditions. Annotations show the speedup of AOTI + custom TP compared to TorchScript for the largest system that both approaches can run. The speeds are measured on the AMD MI250X, NVIDIA A100 (80GB), and NVIDIA H100 GPUs. One MPI rank was used for each simulation (for the MI250X device, one MPI rank corresponds to one of the two available graphics compute dies).
  • Figure 5: Strong scaling of the medium Allegro model on biomolecular systems. The molecular dynamics throughput of the Allegro model deployed using TorchScript, AOTI, or AOTI with the optimized tensor product kernel (AOTI + custom TP) on the 23,558-atom dihydrofolate reductase (DHFR) and 408,609-atom cellulose systems from the Amber20 benchmark amberbench is measured on a number of nodes ranging from 1 to 512 on Frontier (AMD MI250X) and Perlmutter (NVIDIA A100 (40GB)). Note that there are twice as many logical GPU devices and corresponding MPI ranks on an MI250X node (8) than on an A100 node (4). No TorchScript result is shown for cellulose on A100 GPUs because TorchScript required more GPU memory than was available on these nodes.