Table of Contents
Fetching ...

Record Acceleration of the Two-Dimensional Ising Model Using High-Performance Wafer Scale Engine

Dirk Van Essendelft, Hayl Almolyki, Wei Shi, Terry Jordan, Mei-Yu Wang, Wissam A. Saidi

TL;DR

This work demonstrates a pioneering acceleration of the two-dimensional Ising model on the Cerebras Wafer-Scale Engine (WSE) by tailoring a checkerboard Monte Carlo update to the WSE's 2D processing-element grid. It employs a domain-folding strategy and a 16-spin int16 spin packing across eight arrays to minimize memory traffic and enable near-ideal weak scaling with only nearest-neighbor communication. The approach achieves a peak of 61.8 trillion flip attempts per second for lattices up to 200 million spins, with up to 148x speedup over a highly optimized V100 implementation and up to 88x higher productivity vs H100 for multi-simulation workloads. The results highlight the WSE's potential for large-scale scientific computing and materials modeling, enabling massive parallelism in spin-based Monte Carlo simulations.

Abstract

The versatility and wide-ranging applicability of the Ising model, originally introduced to study phase transitions in magnetic materials, have made it a cornerstone in statistical physics and a valuable tool for evaluating the performance of emerging computer hardware. Here, we present a novel implementation of the two-dimensional Ising model on a Cerebras Wafer-Scale Engine (WSE), a revolutionary processor that is opening new frontiers in computing. In our deployment of the checkerboard algorithm, we optimized the Ising model to take advantage of the unique WSE architecture. Specifically, we employed a compressed bit representation storing 16 spins on each int16 word, and efficiently distributed the spins over the processing units enabling seamless weak scaling and limiting communications to only immediate neighboring units. Our implementation can handle up to 754 simulations in parallel, achieving an aggregate of over 61.8 trillion flip attempts per second for Ising models with up to 200 million spins. This represents a gain of up to 148 times over previously reported single-device with a highly optimized implementation on NVIDIA V100 and up to 88 times in productivity compared to NVIDIA H100. Our findings highlight the significant potential of the WSE in scientific computing, particularly in the field of materials modeling.

Record Acceleration of the Two-Dimensional Ising Model Using High-Performance Wafer Scale Engine

TL;DR

This work demonstrates a pioneering acceleration of the two-dimensional Ising model on the Cerebras Wafer-Scale Engine (WSE) by tailoring a checkerboard Monte Carlo update to the WSE's 2D processing-element grid. It employs a domain-folding strategy and a 16-spin int16 spin packing across eight arrays to minimize memory traffic and enable near-ideal weak scaling with only nearest-neighbor communication. The approach achieves a peak of 61.8 trillion flip attempts per second for lattices up to 200 million spins, with up to 148x speedup over a highly optimized V100 implementation and up to 88x higher productivity vs H100 for multi-simulation workloads. The results highlight the WSE's potential for large-scale scientific computing and materials modeling, enabling massive parallelism in spin-based Monte Carlo simulations.

Abstract

The versatility and wide-ranging applicability of the Ising model, originally introduced to study phase transitions in magnetic materials, have made it a cornerstone in statistical physics and a valuable tool for evaluating the performance of emerging computer hardware. Here, we present a novel implementation of the two-dimensional Ising model on a Cerebras Wafer-Scale Engine (WSE), a revolutionary processor that is opening new frontiers in computing. In our deployment of the checkerboard algorithm, we optimized the Ising model to take advantage of the unique WSE architecture. Specifically, we employed a compressed bit representation storing 16 spins on each int16 word, and efficiently distributed the spins over the processing units enabling seamless weak scaling and limiting communications to only immediate neighboring units. Our implementation can handle up to 754 simulations in parallel, achieving an aggregate of over 61.8 trillion flip attempts per second for Ising models with up to 200 million spins. This represents a gain of up to 148 times over previously reported single-device with a highly optimized implementation on NVIDIA V100 and up to 88 times in productivity compared to NVIDIA H100. Our findings highlight the significant potential of the WSE in scientific computing, particularly in the field of materials modeling.
Paper Structure (8 sections, 2 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 8 sections, 2 equations, 5 figures, 1 table, 3 algorithms.

Figures (5)

  • Figure 1: CS-2 Wafer scale engine. (rightmost) A single Wafer Scale Engine is a single processor spanning the largest possible square that can be patterned on a 300mm wafer. Each processing engine (PE) is a Turing-complete, independently programmable computer consisting of a controller, Arithmetic Logic Unit (ALU), 48KB of Static Random-Access Memory (SRAM), and a router that can communicate with nearest-neighbor PEs. All main memory in WSE is SRAM and accessible at L1 cache rates (128b of read and 64b of write on each cycle), which is matched to the ALU processing rates. (middle) Each processor is a collection of dies arranged in a 2D fashion that are then further subdivided into a grid of Processing Elements (PEs). (leftmost) One die hosts thousands of PEs (computational cores, memory and routers). There is no logical discontinuity between adjacent dies and there is no additional bandwidth penalty for crossing the die-die barrier.
  • Figure 2: Conceptual layout of the $12 \times 96$ 2D Ising model on WSE hardware. (a) The WSE is used to solve multiple instances of the 2D Ising model, where each instance is solved independently on a single column of the WSE. Note that the real WSE-2 architecture contains nearly 1 million PEs distributed over 756 columns. We devote one axis of PE’s to one of the axis of spins in the 2D Ising model, the second axis is devoted to simulations in parallel. The second spin axis is held in the memory of each PE. (b) The conceptual layout within the WFA with a single fold. Each simulation spans 5 PEs (3 workers, blue, and 2 moats, orange). The expanded and PBC updated column vectors are spread across the spin axis in order. The moats hold one axis of boundary condition data. The second axis of BC data is held in the top and bottom of each vector on the workers memory space. (c) Checkerboard decomposition. (d) 8-array representation on the WSE. For instance, the RFE array refers to spin belong to red checkerboard in (a) with even index and forward order. BBO refers to spins belonging to blue checkerboard with odd index and backward ordering. The mapping between the Ising lattice and the WSE PEs are also shown in (b) where $k,i$ refers to PE $(x,y)$ coordinates. For simplicity and clarity, we only show few spin indices from 0 to 1152.
  • Figure 3: An illustration of the compacted spin representation around $s_{224}^{RFE}$ for the $12 \times 96$ lattice decomposition. All values displayed are $int16$ integer values that contain 16 bits representing 16 spin states in an ascending order from least to most significant bit. $s_{224}^{RFE}$ contains 16 bits representing spins between 224 (least significant) and 254 (most significant). The right, left, and top neighbors to all spins in $s_{224}^{RFE}$ are blue spins in $s_{320}^{BFE}$, $s_{128}^{BFE}$, and $s_{225}^{BFO}$, respectively. The bottom spins to $s_{224}^{RFE}$ can be derived from $s_{225}^{BFO}$ and $s_{193}^{BFO}$ using logical operators as shown in Algorithm \ref{['alg:pe_getNeighbors']}.
  • Figure 4: The absolute average magnetization values obtained from Monte Carlo simulations on WSE for the 2D Ising model at temperatures between $0.385 J-4.145 J$ for three different sizes of 2D Ising lattices. For comparison, the analytical solution for the thermodynamic infinite large spin system size is also shown along with the critical temperature $T_c/J=2.269,185$ obtained from the analytic solution for infinite large spin system.
  • Figure 5: (a) The flip rate as a function of spins on device. The GPU V100 results are reported in Romero2020CPC while as the A100 and H100 benchmarks are obtained in this study using the same code as in Romero2020CPC. The WSE results are obtained using $N_{\rm{fold}}=1, 3,$ and 5 folding. compares the productivity metric, defined as the cumulative device time per iteration to run $N_{\rm{sim}}$ simulations, between the H100 and WSE. (b) Productivity metric, defined as the cumulative device time per iteration to run $N_{\rm{sim}}$ simulations. The lattice sizes for the WSE measurements are $2048 \times 2048$, $3952 \times 4096$, $7904 \times 8192$, and $11,586 \times 16,384$. The lattice sizes for GPU measurements are $2048\times 2048$, $4096\times 4096$, $8192 \times 8192$, and $16,384\times 16,384$.