Table of Contents
Fetching ...

Scaling Neural-Network-Based Molecular Dynamics with Long-Range Electrostatic Interactions to 51 Nanoseconds per Day

Jianxiong Li, Beining Zhang, Mingzhen Li, Siyu Hu, Jinzhe Zeng, Lijun Liu, Guojun Yuan, Zhan Wang, Guangming Tan, Weile Jia

TL;DR

The paper addresses the bottlenecks in scaling neural-network-based molecular dynamics with long-range electrostatics by optimizing the DPLR framework on Fugaku. It introduces hardware-offloaded 3D FFT (uTofu-FFT), an intra-node overlap strategy that nearly hides PPPM computations, and a ring-based atom-level load balancing scheme, along with node-level task division and a framework-free model-inference pipeline. These contributions yield a 37× speedup over the baseline and enable 51 ns/day for a 564-atom system and 32 ns/day for 403k atoms at large scales, while preserving ab initio accuracy. The results demonstrate strong performance gains on architecture-specific features and offer transferable insights for other NNMD and spatial decomposition workloads.

Abstract

Neural network-based molecular dynamics (NNMD) simulations incorporating long-range electrostatic interactions have significantly extended the applicability to heterogeneous and ionic systems, enabling effective modeling critical physical phenomena such as protein folding and dipolar surface and maintaining ab initio accuracy. However, neural network inference and long-range force computation remain the major bottlenecks, severely limiting simulation speed. In this paper, we target DPLR, a state-of-the-art NNMD package that supports long-range electrostatics, and propose a set of comprehensive optimizations to enhance computational efficiency. We introduce (1) a hardware-offloaded FFT method to reduce the communication overhead; (2) an overlapping strategy that hides long-range force computations using a single core per node, and (3) a ring-based load balancing method that enables atom-level task evenly redistribution with minimal communication overhead. Experimental results on the Fugaku supercomputer show that our work achieves a 37x performance improvement, reaching a maximum simulation speed of 51 ns/day.

Scaling Neural-Network-Based Molecular Dynamics with Long-Range Electrostatic Interactions to 51 Nanoseconds per Day

TL;DR

The paper addresses the bottlenecks in scaling neural-network-based molecular dynamics with long-range electrostatics by optimizing the DPLR framework on Fugaku. It introduces hardware-offloaded 3D FFT (uTofu-FFT), an intra-node overlap strategy that nearly hides PPPM computations, and a ring-based atom-level load balancing scheme, along with node-level task division and a framework-free model-inference pipeline. These contributions yield a 37× speedup over the baseline and enable 51 ns/day for a 564-atom system and 32 ns/day for 403k atoms at large scales, while preserving ab initio accuracy. The results demonstrate strong performance gains on architecture-specific features and offer transferable insights for other NNMD and spatial decomposition workloads.

Abstract

Neural network-based molecular dynamics (NNMD) simulations incorporating long-range electrostatic interactions have significantly extended the applicability to heterogeneous and ionic systems, enabling effective modeling critical physical phenomena such as protein folding and dipolar surface and maintaining ab initio accuracy. However, neural network inference and long-range force computation remain the major bottlenecks, severely limiting simulation speed. In this paper, we target DPLR, a state-of-the-art NNMD package that supports long-range electrostatics, and propose a set of comprehensive optimizations to enhance computational efficiency. We introduce (1) a hardware-offloaded FFT method to reduce the communication overhead; (2) an overlapping strategy that hides long-range force computations using a single core per node, and (3) a ring-based load balancing method that enables atom-level task evenly redistribution with minimal communication overhead. Experimental results on the Fugaku supercomputer show that our work achieves a 37x performance improvement, reaching a maximum simulation speed of 51 ns/day.

Paper Structure

This paper contains 17 sections, 8 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: A schematic overview of the DPLR. (a) The atom environment of each local atom $i$ is constructed, which is represented as a neighbor list within the cutoff range $r_c$. (b) The electronic energy $E_{Gt}$ and the corresponding electrostatic forces $\boldsymbol F_{ele}$ on local atoms (${\partial E_{Gt}}/{\partial \boldsymbol R_i}$) and WCs ($\partial E_{Gt} /{\partial \boldsymbol W_{n(i)}}$) are computed using the PPPM method. (c) The short-range energy contribution $E_{sr}$ is obtained through inference using the DP model, with the associated forces computed via backpropagation. (d) The displacements of the WC, denoted as $\boldsymbol \Delta_n$, are predicted by the DW model, and their gradients $-\partial \boldsymbol \Delta_n / \partial \boldsymbol R_i$ are calculated along all three spatial dimensions.
  • Figure 2: The architecture of the A64FX processor. The processor is organized into four Core Memory Groups (CMGs), coupled with a TofuD controller responsible for inter-node communication. The TofuD controller integrates six TNIs and ten network ports, enabling direct connections to ten neighboring nodes in the 6D torus/mesh topology. Fugaku supports hardware offloading for reduction operations through the BGs embedded within the TNIs, which can be flexibly configured into reduction chains and enable low-latency data aggregation.
  • Figure 3: The utofu-FFT computation process in X-dimension (a) Each MPI rank contains only a subset of real-space grid. (b) Each MPI rank independently performs partial DFT computations on its assigned data segment. (c) A reduction operation is employed across MPI ranks to aggregate partial results and reconstruct the corresponding subset of the K-space data.
  • Figure 4: Hardware-offloaded reduction process for FFT. (a) For a given dimension, nodes are organized into n-rings, each node serving as the master in one ring, responsible for initiating communication and receiving the final reduction result. (b) The reduction communication starts from the node's Start/End BGs. The data is aggregated along the pre-configured reduction chain and then returned to the master node. (c) Data Quantization Method. The original floating-point data is scaled up by a factor of $10^7$ and converted to int32. Each two of them is then packed into a single uint64 for reduction communication. The reduction results are decoded and scaled down to retrieve the final result.
  • Figure 5: The long-range and short-range force overlap strategy. One core in Rank 3 is dedicated to PPPM computations, while the others handle the DW and DP model calculations.
  • ...and 5 more figures