Table of Contents
Fetching ...

Regent based parallel meshfree LSKUM solver for heterogenous HPC platforms

Sanath Salil, Nischay Ram Mamidi, Anil Nemili, Elliott Slaughter

TL;DR

This work develops a Regent-based meshfree LSKUM solver for $2D$ inviscid Euler equations on heterogeneous HPC platforms and benchmarks it against CUDA-C and Fortran+MPI implementations.It introduces a GPU-accelerated Regent path with implicit data movement and a CPU-parallel path based on METIS and Legion, verified on the NACA 0012 test cases.Results show Regent achieves portable performance and competitive CPU scaling, but GPU speedups lag CUDA-C due to lower SM utilisation and higher warp stalls; kernel-level optimizations such as splitting the flux_residual kernel help narrow the gap.The findings support Regent as a viable, maintainable route for cross-architecture CFD codes, with future work targeting $3D$ extensions and broader hardware support.

Abstract

Regent is an implicitly parallel programming language that allows the development of a single codebase for heterogeneous platforms targeting CPUs and GPUs. This paper presents the development of a parallel meshfree solver in Regent for two-dimensional inviscid compressible flows. The meshfree solver is based on the least squares kinetic upwind method. Example codes are presented to show the difference between the Regent and CUDA-C implementations of the meshfree solver on a GPU node. For CPU parallel computations, details are presented on how the data communication and synchronisation are handled by Regent and Fortran+MPI codes. The Regent solver is verified by applying it to the standard test cases for inviscid flows. Benchmark simulations are performed on coarse to very fine point distributions to assess the solver's performance. The computational efficiency of the Regent solver on an A100 GPU is compared with an equivalent meshfree solver written in CUDA-C. The codes are then profiled to investigate the differences in their performance. The performance of the Regent solver on CPU cores is compared with an equivalent explicitly parallel Fortran meshfree solver based on MPI. Scalability results are shown to offer insights into performance.

Regent based parallel meshfree LSKUM solver for heterogenous HPC platforms

TL;DR

This work develops a Regent-based meshfree LSKUM solver for $2D$ inviscid Euler equations on heterogeneous HPC platforms and benchmarks it against CUDA-C and Fortran+MPI implementations.It introduces a GPU-accelerated Regent path with implicit data movement and a CPU-parallel path based on METIS and Legion, verified on the NACA 0012 test cases.Results show Regent achieves portable performance and competitive CPU scaling, but GPU speedups lag CUDA-C due to lower SM utilisation and higher warp stalls; kernel-level optimizations such as splitting the flux_residual kernel help narrow the gap.The findings support Regent as a viable, maintainable route for cross-architecture CFD codes, with future work targeting $3D$ extensions and broader hardware support.

Abstract

Regent is an implicitly parallel programming language that allows the development of a single codebase for heterogeneous platforms targeting CPUs and GPUs. This paper presents the development of a parallel meshfree solver in Regent for two-dimensional inviscid compressible flows. The meshfree solver is based on the least squares kinetic upwind method. Example codes are presented to show the difference between the Regent and CUDA-C implementations of the meshfree solver on a GPU node. For CPU parallel computations, details are presented on how the data communication and synchronisation are handled by Regent and Fortran+MPI codes. The Regent solver is verified by applying it to the standard test cases for inviscid flows. Benchmark simulations are performed on coarse to very fine point distributions to assess the solver's performance. The computational efficiency of the Regent solver on an A100 GPU is compared with an equivalent meshfree solver written in CUDA-C. The codes are then profiled to investigate the differences in their performance. The performance of the Regent solver on CPU cores is compared with an equivalent explicitly parallel Fortran meshfree solver based on MPI. Scalability results are shown to offer insights into performance.
Paper Structure (12 sections, 16 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 12 sections, 16 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Subsonic flow around the NACA 0012 airfoil at $M_{\infty} = 0.63$ and $AoA = 2^o$.
  • Figure 2: Supersonic flow around the NACA 0012 airfoil at $M_{\infty} = 1.2$ and $AoA = 0^o$.
  • Figure 3: GPU memory used by the Regent and CUDA-C codes.
  • Figure 4: Speedup achieved by the Regent and CUDA-C GPU codes.
  • Figure 6: Performance of the Regent code on a CPU node is compared with the Fortran parallel code.
  • ...and 3 more figures