Matrix-Free Finite Volume Kernels on a Dataflow Architecture
Ryuichi Sai, Francois P. Hamon, John Mellor-Crummey, Mauricio Araya-Polo
TL;DR
The paper addresses the memory bandwidth bottlenecks in implicit finite-volume simulations for geological CCS by proposing a matrix-free FV solver implemented on a dataflow wafer-scale architecture. It develops data mapping to a 2D fabric, a whole-fabric all-reduce, a 14-state conjugate gradient, and memory- and communication-oriented optimizations, achieving $ ( \mathbf{J} \mathbf{x})_K = \sum_{L \in \text{adj}(K)} \Upsilon_{KL} \lambda_{KL} (x_L - x_K)$ without assembling the full Jacobian. On a Cerebras CS-2, the implementation attains up to 1.217 PFLOPS and up to 427.82x speedup over GPU baselines, with nearly perfect weak scaling of the matrix-free kernel and favorable scalability for the CG as fabric size grows. These results demonstrate the viability of dataflow architectures for accelerating large-scale CCS subsurface flow simulations, potentially enabling faster design and simulation cycles.
Abstract
Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geomodels, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation time. Modern memory hierarchies are insufficient to meet the latency and bandwidth needs of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory latency and bandwidth bottlenecks. Our implementation achieves two orders of magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.
