A Parallel and Highly-Portable HPC Poisson Solver: Preconditioned Bi-CGSTAB with alpaka
Luca Pennati, Måns I. Andersson, Klaus Steiniger, Rene Widera, Tapish Narwal, Michael Bussmann, Stefano Markidis
TL;DR
This work tackles scalable Poisson solvers for heterogeneous HPC systems by delivering a parallel, matrix-free solver based on Preconditioned Bi-CGSTAB implemented with MPI and the alpaka portability layer. By exploring multiple preconditioners, including a communication-free Chebyshev-based approach, the authors demonstrate significant reductions in iteration counts and dramatic improvements in time-to-solution across CPUs and GPUs. The BiCGS-GNoComm(CI) variant achieves the best overall performance, with more than a 6.5× speedup over the unpreconditioned case and up to 50× over a fully communication-heavy variant, while maintaining robust convergence across AMD and NVIDIA GPUs and strong scalability up to 64 devices. These results underscore the practicality of performance portability for large-scale Poisson problems and highlight the potential of communication-avoiding strategies in heterogeneous HPC environments.
Abstract
This paper presents the design, implementation, and performance analysis of a parallel and GPU-accelerated Poisson solver based on the Preconditioned Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) method. The implementation utilizes the MPI standard for distributed-memory parallelism, while on-node computation is handled using the alpaka framework: this ensures both shared-memory parallelism and inherent performance portability across different hardware architectures. We evaluate the solver's performances on CPUs and GPUs (NVIDIA Hopper H100 and AMD MI250X), comparing different preconditioning strategies, including Block Jacobi and Chebyshev iteration, and analyzing the performances both at single and multi-node level. The execution efficiency is characterized with a strong scaling test and using the AMD Omnitrace profiling tool. Our results indicate that a communication-free preconditioner based on the Chebyshev iteration can speed up the solver by more than six times. The solver shows comparable performances across different GPU architectures, achieving a speed-up in computation up to 50 times compared to the CPU implementation. In addition, it shows a strong scaling efficiency greater than 90% up to 64 devices.
