Table of Contents
Fetching ...

Near-Optimal Wafer-Scale Reduce

Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

TL;DR

The paper tackles the problem of efficiently implementing Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE) by introducing a model-driven design workflow that accounts for the device's multicast-enabled, pipelined on-chip network. It develops a spatial-computing based performance model and uses it to design and auto-generate near-optimal 1D and 2D collective algorithms, including Star, Chain, Tree, Two-Phase, and Auto-Gen reductions. The authors demonstrate up to $3.27\times$ speedups over vendor implementations and achieve predictive accuracy with less than $4\%$ error, validating the approach across a wide range of input sizes and configurations. This model-driven, auto-generation methodology advances wafer-scale algorithm design and broadens the applicability of WSE-based HPC workloads by delivering robust, high-throughput collectives.

Abstract

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

Near-Optimal Wafer-Scale Reduce

TL;DR

The paper tackles the problem of efficiently implementing Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE) by introducing a model-driven design workflow that accounts for the device's multicast-enabled, pipelined on-chip network. It develops a spatial-computing based performance model and uses it to design and auto-generate near-optimal 1D and 2D collective algorithms, including Star, Chain, Tree, Two-Phase, and Auto-Gen reductions. The authors demonstrate up to speedups over vendor implementations and achieve predictive accuracy with less than error, validating the approach across a wide range of input sizes and configurations. This model-driven, auto-generation methodology advances wafer-scale algorithm design and broadens the applicability of WSE-based HPC workloads by delivering robust, high-throughput collectives.

Abstract

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.
Paper Structure (45 sections, 9 theorems, 15 equations, 13 figures, 1 table)

This paper contains 45 sections, 9 theorems, 15 equations, 13 figures, 1 table.

Key Result

lemma 1

$T_{\textsc{Bcast}} = B + P + 2T_R = T_{\textsc{Message}}$

Figures (13)

  • Figure 1: Optimality ratios of 1D Reduce algorithms, where 1.0 is optimal. $^{(\dag)}$our contribution.
  • Figure 2: PE 0 sends a wavelet to the neighbouring PE 1 on the blue color. The router connected to PE 1 forwards the wavelet to the right and sends it up the ramp towards PE 1. This demonstrates the multicasting capability of the network.
  • Figure 3: Synchronization on the WSE occurs through routing configurations. In cycle $t$, router 1 is configured to forward the blue wavelets it gets from PE 1 towards PE 0. As a result, the red wavelets from PE 3 stall at router 2. At cycle $t'$, the last element of the vector from PE $1$ arrives at the router $1$. This triggers a change in routing configuration, such that in cycle $t'+1$ the red wavelets are propagated towards PE 0.
  • Figure 4: Broadcast for 3 consecutive cycles. This example showcases both pipelining and multicasting.
  • Figure 5: Routing configurations for 1D Reduce schemes. Each row shows a configuration. When a PE has sent all its data, it switches to the next configuration. Observe that every path is set up to process a vector of elements in a pipeline. However, if a router is not ready yet to forward data because its PE is still in a previous configuration, this will stall the preceding PE. In this way, the operation is loosely synchronized between configurations.
  • ...and 8 more figures

Theorems & Definitions (9)

  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8
  • lemma 9