Table of Contents
Fetching ...

Investigating Matrix Repartitioning to Address the Over- and Undersubscription Challenge for a GPU-based CFD Solver

Gregor Olenik, Marcel Koch, Hartwig Anzt

TL;DR

The paper tackles the oversubscription problem that arises when integrating GPU solvers into OpenFOAM via plugin-based approaches. It introduces a repartitioning strategy that maps CPU-based matrix assembly to GPU-based solves using a controlled ratio $n_{GPU}=n_{CPU}/\alpha$, and constructs three key data structures (a sparsity pattern, an update pattern $U$, and a permutation matrix $P$) along with a dedicated MPI communicator for GPUs to enable efficient updates. The method is implemented in the OpenFOAM-Ginkgo Layer and evaluated on a lid-driven cavity benchmark, showing that repartitioning yields substantial performance gains (up to around $10\times$ speedups) and mitigates GPU oversubscription, with additional improvements when using GPU-aware MPI. The approach provides a practical, less invasive path to efficient heterogeneous HPC on OpenFOAM workflows and motivates future work on industrial-scale cases and extensions to distributed multigrid and preconditioning strategies.

Abstract

Modern high-performance computing (HPC) increasingly relies on GPUs, but integrating GPU acceleration into complex scientific frameworks like OpenFOAM remains a challenge. Existing approaches either fully refactor the codebase or use plugin-based GPU solvers, each facing trade-offs between performance and development effort. In this work, we address the limitations of plugin-based GPU acceleration in OpenFOAM by proposing a repartitioning strategy that better balances CPU matrix assembly and GPU-based linear solves. We present a detailed computational model, describe a novel matrix repartitioning and update procedure, and evaluate its performance on large-scale CFD simulations. Our results show that the proposed method significantly mitigates oversubscription issues, improving solver performance and resource utilization in heterogeneous CPU-GPU environments.

Investigating Matrix Repartitioning to Address the Over- and Undersubscription Challenge for a GPU-based CFD Solver

TL;DR

The paper tackles the oversubscription problem that arises when integrating GPU solvers into OpenFOAM via plugin-based approaches. It introduces a repartitioning strategy that maps CPU-based matrix assembly to GPU-based solves using a controlled ratio , and constructs three key data structures (a sparsity pattern, an update pattern , and a permutation matrix ) along with a dedicated MPI communicator for GPUs to enable efficient updates. The method is implemented in the OpenFOAM-Ginkgo Layer and evaluated on a lid-driven cavity benchmark, showing that repartitioning yields substantial performance gains (up to around speedups) and mitigates GPU oversubscription, with additional improvements when using GPU-aware MPI. The approach provides a practical, less invasive path to efficient heterogeneous HPC on OpenFOAM workflows and motivates future work on industrial-scale cases and extensions to distributed multigrid and preconditioning strategies.

Abstract

Modern high-performance computing (HPC) increasingly relies on GPUs, but integrating GPU acceleration into complex scientific frameworks like OpenFOAM remains a challenge. Existing approaches either fully refactor the codebase or use plugin-based GPU solvers, each facing trade-offs between performance and development effort. In this work, we address the limitations of plugin-based GPU acceleration in OpenFOAM by proposing a repartitioning strategy that better balances CPU matrix assembly and GPU-based linear solves. We present a detailed computational model, describe a novel matrix repartitioning and update procedure, and evaluate its performance on large-scale CFD simulations. Our results show that the proposed method significantly mitigates oversubscription issues, improving solver performance and resource utilization in heterogeneous CPU-GPU environments.

Paper Structure

This paper contains 5 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: Flow chart of the principal steps within a timestep in the icoFOAM solver.
  • Figure 2: Structure of the distributed matrix in LDU format on the host (top) and after repartitioning on the accelerator (bottom), with $\alpha = 2$.
  • Figure 3: Illustration of the repartitioning procedure.
  • Figure 4: Impact of the repartitioning ratio RPG on the linear solver performance in Tflop/s for a different number of compute nodes and problem sizes.
  • Figure 5: Impact of the repartitioning ratio RPG on the time spent on host-side computations for different problem sizes and number of compute nodes.
  • ...and 4 more figures