Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System
Kylee Santos, Stan Moore, Tomas Oppelstrup, Amirali Sharifian, Ilya Sharapov, Aidan Thompson, Delyan Z Kalchev, Danny Perez, Robert Schreiber, Scott Pakin, Edgar A Leon, James H Laros, Michael James, Sivasankaran Rajamanickam
TL;DR
This work demonstrates that a wafer-scale dataflow architecture can break the MD timescale barrier by mapping one atom per core on the Cerebras WSE, achieving up to ~179x speedups over Frontier and enabling MD simulations of hundreds of thousands of timesteps per second for systems of ~800k atoms. Through innovations such as locality-preserving atom mapping, systolic marching multicast for neighborhood exchange, efficient neighbor lists, atom swapping, and careful handling of periodic boundaries, the authors attain near-ideal strong and weak scaling on a monolithic, low-latency fabric. A detailed performance model, extensive measurements, and grain-boundary case studies show not only dramatic speedups but also substantial energy efficiency and the potential to reach 100 microseconds to milliseconds of simulated time for materials systems. The results imply a transformative path for MD and materials science, enabling direct exploration of slow processes and complex microstructures that were previously out of reach on conventional HPC, with implications for future wafer-scale HPC designs and high-performance computing strategies.
Abstract
Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the Cerebras Wafer-Scale Engine. By dedicating a processor core for each simulated atom, we demonstrate a 179-fold improvement in timesteps per second versus the Frontier GPU-based Exascale platform, along with a large improvement in timesteps per unit energy. Reducing every year of runtime to two days unlocks currently inaccessible timescales of slow microstructure transformation processes that are critical for understanding material behavior and function. Our dataflow algorithm runs Embedded Atom Method (EAM) simulations at rates over 270,000 timesteps per second for problems with up to 800k atoms. This demonstrated performance is unprecedented for general-purpose processing cores.
