Table of Contents
Fetching ...

Massive parallelization and performance enhancement of an immersed boundary method based unsteady flow solver

Rahul Sundar, Dipanjan Majumdar, Chhote Lal Shah, Sunetra Sarkar

TL;DR

The paper tackles the high cost of simulating unsteady flows with immersed boundary methods by porting an in-house IBM solver to GPUs using OpenACC. It develops an incremental porting workflow, evaluating performance against CPU-based baselines and achieving speedups up to $O(10)$ over the OpenMP version and up to $O(10^2)$ over the serial code, with favorable scaling as mesh size increases. The work demonstrates that careful hotspot analysis, loop-level parallelization, and minimized CPU-GPU data transfers are key to realizing GPU performance gains in IBM-based CFD. The findings have practical impact for rapid design space exploration in complex fluid-structure interaction problems and establish a pathway for extending the approach to other in-house solvers.

Abstract

High-fidelity simulations of unsteady fluid flow are now possible with advancements in high-performance computing hardware and software frameworks. Since computational fluid dynamics (CFD) computations are dominated by linear algebraic routines, they can be significantly accelerated through massive parallelization on graphics processing units (GPUs). Thus, GPU implementation of high-fidelity CFD solvers is essential in reducing the turnaround time for quicker design space exploration. In the present work, an immersed boundary method (IBM) based in-house flow solver has been ported to the GPU using OpenACC, a compiler directive-based heterogeneous parallel programming framework. Out of various GPU porting pathways available, OpenACC was chosen because of its minimum code intrusion, low development time, and striking similarity with OpenMP, a similar directive-based shared memory programming framework. A detailed validation study and performance analysis of the parallel solver implementations on the CPU and GPU are presented. The GPU implementation shows a speedup up to the order $O(10)$ over the CPU parallel version and up to the order $O(10^2)$ over the serial code. The GPU implementation also scales well with increasing mesh size owing to the efficient utilization of the GPU processor cores.

Massive parallelization and performance enhancement of an immersed boundary method based unsteady flow solver

TL;DR

The paper tackles the high cost of simulating unsteady flows with immersed boundary methods by porting an in-house IBM solver to GPUs using OpenACC. It develops an incremental porting workflow, evaluating performance against CPU-based baselines and achieving speedups up to over the OpenMP version and up to over the serial code, with favorable scaling as mesh size increases. The work demonstrates that careful hotspot analysis, loop-level parallelization, and minimized CPU-GPU data transfers are key to realizing GPU performance gains in IBM-based CFD. The findings have practical impact for rapid design space exploration in complex fluid-structure interaction problems and establish a pathway for extending the approach to other in-house solvers.

Abstract

High-fidelity simulations of unsteady fluid flow are now possible with advancements in high-performance computing hardware and software frameworks. Since computational fluid dynamics (CFD) computations are dominated by linear algebraic routines, they can be significantly accelerated through massive parallelization on graphics processing units (GPUs). Thus, GPU implementation of high-fidelity CFD solvers is essential in reducing the turnaround time for quicker design space exploration. In the present work, an immersed boundary method (IBM) based in-house flow solver has been ported to the GPU using OpenACC, a compiler directive-based heterogeneous parallel programming framework. Out of various GPU porting pathways available, OpenACC was chosen because of its minimum code intrusion, low development time, and striking similarity with OpenMP, a similar directive-based shared memory programming framework. A detailed validation study and performance analysis of the parallel solver implementations on the CPU and GPU are presented. The GPU implementation shows a speedup up to the order over the CPU parallel version and up to the order over the serial code. The GPU implementation also scales well with increasing mesh size owing to the efficient utilization of the GPU processor cores.
Paper Structure (20 sections, 6 equations, 6 figures, 2 tables)

This paper contains 20 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Computational domain
  • Figure 2: Schematics representing the minimal code changes when using OpenMP and OpenACC
  • Figure 3: Nsight-systems profiler output for the NVTX annotated serial version of the IBM solver for a single time marching step.
  • Figure 4: Final Nsight-systems profiler outputs for the NVTX annotated OpenACC version of the IBM solver for a single time marching step.
  • Figure 5: Plots comparing the (a) lift and (b) drag coefficient time histories for serial, OpenMP, OpenACC implementations with the results of Khalid et al.khalid2018bifurcations for a sinusoidally plunging rigid elliptic foil.
  • ...and 1 more figures