Table of Contents
Fetching ...

Scaling Molecular Dynamics with ab initio Accuracy to 149 Nanoseconds per Day

Jianxiong Li, Boyang Li, Zhuoqiang Guo, Mingzhen Li, Enji Li, Lijun Liu, Guojun Yuan, Zhan Wang, Guangming Tan, Weile Jia

TL;DR

This work targets the bottleneck of achieving ab initio molecular dynamics over long timescales by enhancing DeePMD-kit on the Fugaku supercomputer. The authors introduce a node-based parallelization approach that minimizes inter-node communication, complemented by computation- and memory-optimizations (TensorFlow removal, SVE-GEMM, mixed precision, RDMA memory pools, and threadpool execution) and an intra-node load-balancing strategy. Together, these advances yield a 31.7x speedup, reaching up to 149 $ns/day$ for copper and 68.5 $ns/day$ for water on 12,000 nodes, with sustained strong scaling and practical load balance. The results demonstrate the feasibility of millisecond-scale ab initio MD within a week, with broad implications for NNMD and domain-decomposition workloads across HPC systems.

Abstract

Physical phenomena such as chemical reactions, bond breaking, and phase transition require molecular dynamics (MD) simulation with ab initio accuracy ranging from milliseconds to microseconds. However, previous state-of-the-art neural network based MD packages such as DeePMD-kit can only reach 4.7 nanoseconds per day on the Fugaku supercomputer. In this paper, we present a novel node-based parallelization scheme to reduce communication by 81%, then optimize the computationally intensive kernels with sve-gemm and mixed precision. Finally, we implement intra-node load balance to further improve the scalability. Numerical results on the Fugaku supercomputer show that our work has significantly improved the time-to-solution of the DeePMD-kit by a factor of 31.7x, reaching 149 nanoseconds per day on 12,000 computing nodes. This work has opened the door for millisecond simulation with ab initio accuracy within one week for the first time.

Scaling Molecular Dynamics with ab initio Accuracy to 149 Nanoseconds per Day

TL;DR

This work targets the bottleneck of achieving ab initio molecular dynamics over long timescales by enhancing DeePMD-kit on the Fugaku supercomputer. The authors introduce a node-based parallelization approach that minimizes inter-node communication, complemented by computation- and memory-optimizations (TensorFlow removal, SVE-GEMM, mixed precision, RDMA memory pools, and threadpool execution) and an intra-node load-balancing strategy. Together, these advances yield a 31.7x speedup, reaching up to 149 for copper and 68.5 for water on 12,000 nodes, with sustained strong scaling and practical load balance. The results demonstrate the feasibility of millisecond-scale ab initio MD within a week, with broad implications for NNMD and domain-decomposition workloads across HPC systems.

Abstract

Physical phenomena such as chemical reactions, bond breaking, and phase transition require molecular dynamics (MD) simulation with ab initio accuracy ranging from milliseconds to microseconds. However, previous state-of-the-art neural network based MD packages such as DeePMD-kit can only reach 4.7 nanoseconds per day on the Fugaku supercomputer. In this paper, we present a novel node-based parallelization scheme to reduce communication by 81%, then optimize the computationally intensive kernels with sve-gemm and mixed precision. Finally, we implement intra-node load balance to further improve the scalability. Numerical results on the Fugaku supercomputer show that our work has significantly improved the time-to-solution of the DeePMD-kit by a factor of 31.7x, reaching 149 nanoseconds per day on 12,000 computing nodes. This work has opened the door for millisecond simulation with ab initio accuracy within one week for the first time.

Paper Structure

This paper contains 24 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The sub-box and workflow of the DeePMD model. (a) Each MPI rank holds a sub-box, and the corresponding neighbor list within cutoff radius $r_c$ is built within the LAMMPS framework. Note that ghost regions are communicated between adjacent MPI ranks. (b) The execution of the DeePMD model. Note that the atomic energy $E_i$ is carried out via forward propagation, and the atomic force $F_i$ is calculated through backward propagation.
  • Figure 2: (a) The A64FX CPU has four NUMA domains and one TofuD controller, which are all connected via a high-speed network-on-chip (NoC). Each NUMA domain has 13 cores, one for OS and IO and the other 12 cores for computation. Each TofuD controller is equipped with 10 ports connecting with other CPUs, as well as 6 RDMA engines (TNI) capable of simultaneously sending and receiving six packets in parallel. (b) 6D torus topology of the TofuD network. 12 nodes form a cell and are connected with each other through a 3D-Torus topology. Besides, each node also connects with corresponding nodes in neighboring cells. Considering that 6D torus topology can be transformed into a logical 3D torus topology, applications that use domain decomposition (such as LAMMPS) can directly map onto the system.
  • Figure 3: MPI-based and node-based parallelization schemes. (a) MPI-based parallelization scheme. Rank 4 and rank 5 in node 1 has to send their atoms to rank 0-3 in node 0, while rank 6 and rank 7 have to send atoms to rank 2-3. (b) Node-based parallelization scheme. The leader of node 1 gathers all atoms within the node and only has to send one message to node 0.
  • Figure 4: The workflow of communication in the node-based parallelization scheme. The shared memory is accessible to MPI ranks within the same node via libnuma. Additionally, the RDMA memory is registered by libutofu for RDMA communication and is located in the shared memory. Crucial atomic structures, including position and type information, are stored in the shared memory.
  • Figure 5: Atom organization of the two parallelization schemes in Rank 2. (a) The original parallelization scheme. $lcl\_Rx$ represents the ghost atoms from neighboring MPI rank within the same nodes, while $rmt\_Rx$ refers to the ghost atoms from neighboring MPI rank on other nodes. $nlocal$ and $nghost$ are the number of local atoms and ghost atoms, respectively. (b) The node-based parallelization scheme. $node\_nlocal$ is the local atom number within the node, while $node\_nghost$ is the ghost atom number associated with the node-box. $nodex$ represents different ghost atom groups from other nodes.
  • ...and 6 more figures