Table of Contents
Fetching ...

Matrix-Free Finite Volume Kernels on a Dataflow Architecture

Ryuichi Sai, Francois P. Hamon, John Mellor-Crummey, Mauricio Araya-Polo

TL;DR

The paper addresses the memory bandwidth bottlenecks in implicit finite-volume simulations for geological CCS by proposing a matrix-free FV solver implemented on a dataflow wafer-scale architecture. It develops data mapping to a 2D fabric, a whole-fabric all-reduce, a 14-state conjugate gradient, and memory- and communication-oriented optimizations, achieving $ ( \mathbf{J} \mathbf{x})_K = \sum_{L \in \text{adj}(K)} \Upsilon_{KL} \lambda_{KL} (x_L - x_K)$ without assembling the full Jacobian. On a Cerebras CS-2, the implementation attains up to 1.217 PFLOPS and up to 427.82x speedup over GPU baselines, with nearly perfect weak scaling of the matrix-free kernel and favorable scalability for the CG as fabric size grows. These results demonstrate the viability of dataflow architectures for accelerating large-scale CCS subsurface flow simulations, potentially enabling faster design and simulation cycles.

Abstract

Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geomodels, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation time. Modern memory hierarchies are insufficient to meet the latency and bandwidth needs of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory latency and bandwidth bottlenecks. Our implementation achieves two orders of magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.

Matrix-Free Finite Volume Kernels on a Dataflow Architecture

TL;DR

The paper addresses the memory bandwidth bottlenecks in implicit finite-volume simulations for geological CCS by proposing a matrix-free FV solver implemented on a dataflow wafer-scale architecture. It develops data mapping to a 2D fabric, a whole-fabric all-reduce, a 14-state conjugate gradient, and memory- and communication-oriented optimizations, achieving without assembling the full Jacobian. On a Cerebras CS-2, the implementation attains up to 1.217 PFLOPS and up to 427.82x speedup over GPU baselines, with nearly perfect weak scaling of the matrix-free kernel and favorable scalability for the CG as fabric size grows. These results demonstrate the viability of dataflow architectures for accelerating large-scale CCS subsurface flow simulations, potentially enabling faster design and simulation cycles.

Abstract

Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geomodels, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation time. Modern memory hierarchies are insufficient to meet the latency and bandwidth needs of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory latency and bandwidth bottlenecks. Our implementation achieves two orders of magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.
Paper Structure (24 sections, 6 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 6 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: A 7-point stencil used in the flux computation.
  • Figure 2: An overview of the Wafer Scale Engine (WSE). The WSE (to the right) occupies an entire wafer, and is a 2D array of dies. Each die is itself a grid of tiles (in the middle), which contains a processing element (to the left). Each processing element (PE) has its own router, that connects to the PE itself and the routes in its four cardinal neighboring routers. Each PE has 48KB of local memory and in total $\approx$ 850,000 PEs are available for computing. Figure credit: MauricioSC22
  • Figure 3: Three-dimensional problem mapping to a two-dimensional fabric of processing elements using a cell-based approach.
  • Figure 4: Eastward localized broadcast operation used to exchange cell values along the X dimension.
  • Figure 5: Pressure propagation from the source (top left in the left plot) to the producer (bottom right of the mentioned plot). The right plot depicts the source in detail.
  • ...and 1 more figures