Table of Contents
Fetching ...

Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

Tomas Oppelstrup, Nicholas Giamblanco, Delyan Z. Kalchev, Ilya Sharapov, Mark Taylor, Dirk Van Essendelft, Sivasankaran Rajamanickam, Michael James

TL;DR

The paper introduces Domain Translation, a latency-hiding domain mapping algorithm, and demonstrates its effectiveness on a 64-node Cerebras Wafer-Scale Engine cluster. By tilting the space-time calculation and enforcing predominantly unidirectional data flow, the method hides inter-node latency and achieves compute-bound time stepping across large-scale stencil workloads. Across 5-point and 9-point heat equations and Shallow Water Equations for planetary-scale tsunami modeling, the approach yields near-perfect weak scaling (up to 88% of peak) and substantial performance gains, including up to $112\ \text{PFLOP/s}$ and $57\ \text{GFLOP/J}$ under favorable power conditions. This work showcases a viable path to exascale-ready PDE solvers on dataflow, wafer-scale hardware, with significant implications for weather, Earth-system modeling, and real-time digital twins.

Abstract

Simulation of physical systems is essential across scientific and engineering domains. Commonly used domain decomposition methods are unable to simultaneously deliver both high simulation rate and high utilization in network computing environments. In particular, Exascale systems deliver only a small fraction their peak performance for these workloads. This paper introduces the novel Domain Translation algorithm, designed to overcome these limitations. On a cluster of 64 Cerebras CS-3 systems, we use this method to demonstrate unprecedented cluster performance across a range of metrics: we show simulations running in excess of 1.6 million time steps per second; we also demonstrate perfect weak scaling at 88% of peak performance. At this cluster scale, our implementation provides 112 PFLOP/s in a power-unconstrained environment, and 57 GFLOP/J in a power-limited environment. We illustrate the method by applying the shallow-water equations to model a tsunami following an asteroid impact at 460m-resolution on a planetary scale.

Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

TL;DR

The paper introduces Domain Translation, a latency-hiding domain mapping algorithm, and demonstrates its effectiveness on a 64-node Cerebras Wafer-Scale Engine cluster. By tilting the space-time calculation and enforcing predominantly unidirectional data flow, the method hides inter-node latency and achieves compute-bound time stepping across large-scale stencil workloads. Across 5-point and 9-point heat equations and Shallow Water Equations for planetary-scale tsunami modeling, the approach yields near-perfect weak scaling (up to 88% of peak) and substantial performance gains, including up to and under favorable power conditions. This work showcases a viable path to exascale-ready PDE solvers on dataflow, wafer-scale hardware, with significant implications for weather, Earth-system modeling, and real-time digital twins.

Abstract

Simulation of physical systems is essential across scientific and engineering domains. Commonly used domain decomposition methods are unable to simultaneously deliver both high simulation rate and high utilization in network computing environments. In particular, Exascale systems deliver only a small fraction their peak performance for these workloads. This paper introduces the novel Domain Translation algorithm, designed to overcome these limitations. On a cluster of 64 Cerebras CS-3 systems, we use this method to demonstrate unprecedented cluster performance across a range of metrics: we show simulations running in excess of 1.6 million time steps per second; we also demonstrate perfect weak scaling at 88% of peak performance. At this cluster scale, our implementation provides 112 PFLOP/s in a power-unconstrained environment, and 57 GFLOP/J in a power-limited environment. We illustrate the method by applying the shallow-water equations to model a tsunami following an asteroid impact at 460m-resolution on a planetary scale.

Paper Structure

This paper contains 18 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustrative example: a three point stencil in one dimension. (a) On single node there are no external dependencies and each time step takes one unit of time. (b) Static domain decomposition between nodes imposes an extra latency for communicating across node boundary (10 units in this example). Grid points adjacent to the boundary experience additive latency delay at every time step. (c) If the partition shifts by one unit at each time step, the cross-node latency is applied only once because the dependency across the domain boundary is unidirectional.
  • Figure 2: Domain translation method. Diagram shows grid points (x-axis) by timestep (y-axis). The diagonal lines indicate high-latency subdomain boundaries. The diagram depicts the algorithm's steady-state duty cycle. When the node receives a package (blue) it initiates a computation sweep that uses stored values (yellow) to produce the new grid point state (red). At the end of the sweep it submits its last computed values into the downstream network pipeline. The time used to complete the computation sweep is the same time it takes a network package to advance one "step" through the network pipeline. The end-to-end network latency coincides with $n /2p$ steps. The processor's wall-clock time proceeds in the direction of the "domain translation" arrow. Note that in the processor's time frame, it proceeds through a full network pipeline's worth of computation sweeps prior to receiving data that had just entered the upstream network pipeline.
  • Figure 3: Time step rate can be limited by the network latency, bandwidth, or compute throughput. If the number of grid points per node is sufficiently large, the network effects are fully hidden and the kernels are expected to run at full computational utilization.
  • Figure 4: Switch Topology for an arbitrary number of WSEs. The same port index for all wafers in a cluster live on the same network switch.
  • Figure 5: An example cluster topology showing the original spatial stencil and its Y-axis mirror which alternate between every wafer in the cluster. Mirroring enables data to enter and exit through the same switch, minimizing communication latency.
  • ...and 6 more figures