GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme

Anna Yesypenko; Per-Gunnar Martinsson

GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme

Anna Yesypenko, Per-Gunnar Martinsson

TL;DR

The paper tackles efficient 2D Helmholtz solves using the Hierarchical Poincaré-Steklov (HPS) discretization, which couples high-order local discretizations with sparse direct solvers. It introduces GPU-accelerated leaf computations via batched BLAS and compares two sparse-solver paths for the reduced interface system: a black-box solver with nested dissection and SlabLU, a two-level slab-based method with batched parallelism and $\mathcal{H}$-matrix-like structure. The key contributions are demonstrating substantial GPU speedups for the costly leaf condensation (scaling as $O(p^4 N)$) up to $p=42$, and showing that SlabLU offers favorable build-time and memory footprints while maintaining high accuracy, enabling 2D problems up to $N \approx 8.1\times 10^{7}$ unknowns. The results indicate that high-order HPS can resolve high-frequency scattering on curved and rectangular domains with significant practical impact and potential extension to 3D, albeit with increased complexity in $p$-scaling.

Abstract

This manuscript presents GPU optimizations for the 2D Hierarchical Poincaré-Steklov (HPS) discretization scheme. HPS is a multi-domain spectral collocation method that combines high-order discretizations with direct solvers to accurately resolve highly oscillatory solutions. The domain decomposition approach of HPS connects domains directly via a sparse direct solver. The proposed optimizations exploit batched linear algebra on modern hybrid architectures, are straightforward to implement, and improve the solver's practical speed. The manuscript demonstrates that GPU optimizations can significantly reduce the traditionally high cost of performing local static condensation for discretizations with very high local order $p$. Numerical experiments for the Helmholtz equation with high wavenumbers on curved and rectangular domains confirm the high accuracy achieved by the HPS discretization and the significant reduction in computation time achieved with GPU optimizations.

GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme

TL;DR

-matrix-like structure. The key contributions are demonstrating substantial GPU speedups for the costly leaf condensation (scaling as

) up to

, and showing that SlabLU offers favorable build-time and memory footprints while maintaining high accuracy, enabling 2D problems up to

unknowns. The results indicate that high-order HPS can resolve high-frequency scattering on curved and rectangular domains with significant practical impact and potential extension to 3D, albeit with increased complexity in

-scaling.

Abstract

. Numerical experiments for the Helmholtz equation with high wavenumbers on curved and rectangular domains confirm the high accuracy achieved by the HPS discretization and the significant reduction in computation time achieved with GPU optimizations.

Paper Structure (6 sections, 11 equations, 10 figures)

This paper contains 6 sections, 11 equations, 10 figures.

Introduction
HPS Discretization and Interfacing with Sparse Direct Solvers
Numerical Experiments
Comparison of Sparse Direct Solvers
Convergence for Scattering Problems for various $p$
Conclusions

Figures (10)

Figure 1: Prior to interfacing with sparse direct solvers, we do static condensation to produce an equivalent system (\ref{['eq:hps_reduced']}) to solve on the remaining active nodes. The original grid has $N$ points, and remaining grid has $\approx N/p$ points.
Figure 2: Domain decomposition used in SlabLU. The even-numbered nodes correspond to the nodes interior to each subdomain. The odd-numbered nodes correspond to interfaces between slabs. The slab partitioning is chosen so that interactions between slab interiors are zero. The slabs have width of $b$ points.
Figure 3: Leaf operations for HPS require $O(p^4 N)$ operations, though the practical scaling for parallel operations has a small constant prefactor for $p$ up to 42. Parallel HPS leaf operations are further accelerated on GPUs, with a speed-up of at least 4x.
Figure 4: Build time and memory footprint comparison between SuperLU and SlabLU for (\ref{['eq:constant_coeff_helm']}) discretized with HPS ($p$=22), where $\kappa$ is increased to maintain 10 points per wavelength. For $N$=5.06M, the SlabLU build time is faster by a factor of 16x and the memory footprint is less by a factor of 14x.
Figure 6: Build time and solve time for HPS with various $p$ for (\ref{['eq:constant_coeff_helm']}) where $\kappa$ is increased with $N$ to maintain 10 points per wavelength. The choice of $p$ does not substantially affect the time needed to factorize the sparse linear system with SlabLU. As $p$ increases, the memory footprint required to store the factorization decreases and the solve time increases.
...and 5 more figures

Theorems & Definitions (1)

remark 1

GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme

TL;DR

Abstract

GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (1)