Table of Contents
Fetching ...

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

Kylee Santos, Stan Moore, Tomas Oppelstrup, Amirali Sharifian, Ilya Sharapov, Aidan Thompson, Delyan Z Kalchev, Danny Perez, Robert Schreiber, Scott Pakin, Edgar A Leon, James H Laros, Michael James, Sivasankaran Rajamanickam

TL;DR

This work demonstrates that a wafer-scale dataflow architecture can break the MD timescale barrier by mapping one atom per core on the Cerebras WSE, achieving up to ~179x speedups over Frontier and enabling MD simulations of hundreds of thousands of timesteps per second for systems of ~800k atoms. Through innovations such as locality-preserving atom mapping, systolic marching multicast for neighborhood exchange, efficient neighbor lists, atom swapping, and careful handling of periodic boundaries, the authors attain near-ideal strong and weak scaling on a monolithic, low-latency fabric. A detailed performance model, extensive measurements, and grain-boundary case studies show not only dramatic speedups but also substantial energy efficiency and the potential to reach 100 microseconds to milliseconds of simulated time for materials systems. The results imply a transformative path for MD and materials science, enabling direct exploration of slow processes and complex microstructures that were previously out of reach on conventional HPC, with implications for future wafer-scale HPC designs and high-performance computing strategies.

Abstract

Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the Cerebras Wafer-Scale Engine. By dedicating a processor core for each simulated atom, we demonstrate a 179-fold improvement in timesteps per second versus the Frontier GPU-based Exascale platform, along with a large improvement in timesteps per unit energy. Reducing every year of runtime to two days unlocks currently inaccessible timescales of slow microstructure transformation processes that are critical for understanding material behavior and function. Our dataflow algorithm runs Embedded Atom Method (EAM) simulations at rates over 270,000 timesteps per second for problems with up to 800k atoms. This demonstrated performance is unprecedented for general-purpose processing cores.

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

TL;DR

This work demonstrates that a wafer-scale dataflow architecture can break the MD timescale barrier by mapping one atom per core on the Cerebras WSE, achieving up to ~179x speedups over Frontier and enabling MD simulations of hundreds of thousands of timesteps per second for systems of ~800k atoms. Through innovations such as locality-preserving atom mapping, systolic marching multicast for neighborhood exchange, efficient neighbor lists, atom swapping, and careful handling of periodic boundaries, the authors attain near-ideal strong and weak scaling on a monolithic, low-latency fabric. A detailed performance model, extensive measurements, and grain-boundary case studies show not only dramatic speedups but also substantial energy efficiency and the potential to reach 100 microseconds to milliseconds of simulated time for materials systems. The results imply a transformative path for MD and materials science, enabling direct exploration of slow processes and complex microstructures that were previously out of reach on conventional HPC, with implications for future wafer-scale HPC designs and high-performance computing strategies.

Abstract

Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the Cerebras Wafer-Scale Engine. By dedicating a processor core for each simulated atom, we demonstrate a 179-fold improvement in timesteps per second versus the Frontier GPU-based Exascale platform, along with a large improvement in timesteps per unit energy. Reducing every year of runtime to two days unlocks currently inaccessible timescales of slow microstructure transformation processes that are critical for understanding material behavior and function. Our dataflow algorithm runs Embedded Atom Method (EAM) simulations at rates over 270,000 timesteps per second for problems with up to 800k atoms. This demonstrated performance is unprecedented for general-purpose processing cores.
Paper Structure (26 sections, 6 equations, 10 figures, 6 tables)

This paper contains 26 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of maximum MD timescale achievable using Cerebras Wafer-Scale Engine (WSE, green) and Exascale GPU hardware (GPU, gray). The boxes represent typical achievable ranges of length and time using different materials simulation approaches: quantum electronic methods (QM, left box), molecular dynamics (MD, middle box) and continuum mechanics (CM, right box). Green and gray stars reflect measured performance for 800,000 Ta atoms (see Fig. \ref{['fig:perfpanel']}), assuming 30 days of wall-clock time on WSE and GPU hardware, respectively. The nearly 180-fold increase in maximum achievable timescale for MD using WSE is transformative for a broad range of applications in materials science, chemistry, and physics.
  • Figure 2: Two views of a grain boundary in tungsten (W). Atoms in the grain boundary are shown in white. The other colors represent different crystal orientations. Upper panel: the difference in crystal lattice orientation can be seen above and below the grain boundary. Lower panel: Although more complex and less clearly defined, there is also structure in the grain boundary.
  • Figure 3: Example of the candidate exchange for a $5\times 5$ neighborhood, i.e. $b=2$. (a) Overlapping neighborhoods of two distinct atoms. All atoms within the red and blue atoms' interaction thresholds are contained within the red and blue outlined regions, respectively. (b) Horizontal stage of neighborhood multicast (c) Vertical stage of neighborhood multicast (d) First positive-horizontal-multicast transmission. Red head tiles multicast atom data to right two tiles (e) Second multicast transmission. The roles of tiles in the multicast domain have shifted one tile to the right. (f) Last multicast transmission. After this, all tiles in the fabric have transmitted their atom two hops to the right. Leftward transmission (not depicted) occurs concurrently.
  • Figure 4: (a) Systolic routing pipeline diagram for the marching multicast, with $b=3$. (b) Router state machine for marching multicast. (c) Tungsten code for neighborhood communication's horizontal stage. The vertical stage differs only in its transfer size.
  • Figure 5: Periodic space to a line segment.
  • ...and 5 more figures