Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

Tomas Oppelstrup; Nicholas Giamblanco; Delyan Z. Kalchev; Ilya Sharapov; Mark Taylor; Dirk Van Essendelft; Sivasankaran Rajamanickam; Michael James

Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

Tomas Oppelstrup, Nicholas Giamblanco, Delyan Z. Kalchev, Ilya Sharapov, Mark Taylor, Dirk Van Essendelft, Sivasankaran Rajamanickam, Michael James

TL;DR

The paper introduces Domain Translation, a latency-hiding domain mapping algorithm, and demonstrates its effectiveness on a 64-node Cerebras Wafer-Scale Engine cluster. By tilting the space-time calculation and enforcing predominantly unidirectional data flow, the method hides inter-node latency and achieves compute-bound time stepping across large-scale stencil workloads. Across 5-point and 9-point heat equations and Shallow Water Equations for planetary-scale tsunami modeling, the approach yields near-perfect weak scaling (up to 88% of peak) and substantial performance gains, including up to $112\ \text{PFLOP/s}$ and $57\ \text{GFLOP/J}$ under favorable power conditions. This work showcases a viable path to exascale-ready PDE solvers on dataflow, wafer-scale hardware, with significant implications for weather, Earth-system modeling, and real-time digital twins.

Abstract

Simulation of physical systems is essential across scientific and engineering domains. Commonly used domain decomposition methods are unable to simultaneously deliver both high simulation rate and high utilization in network computing environments. In particular, Exascale systems deliver only a small fraction their peak performance for these workloads. This paper introduces the novel Domain Translation algorithm, designed to overcome these limitations. On a cluster of 64 Cerebras CS-3 systems, we use this method to demonstrate unprecedented cluster performance across a range of metrics: we show simulations running in excess of 1.6 million time steps per second; we also demonstrate perfect weak scaling at 88% of peak performance. At this cluster scale, our implementation provides 112 PFLOP/s in a power-unconstrained environment, and 57 GFLOP/J in a power-limited environment. We illustrate the method by applying the shallow-water equations to model a tsunami following an asteroid impact at 460m-resolution on a planetary scale.

Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

TL;DR

Abstract

Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)