GPU Optimizations for the Hierarchical Poincaré-Steklov Scheme
Anna Yesypenko, Per-Gunnar Martinsson
TL;DR
The paper tackles efficient 2D Helmholtz solves using the Hierarchical Poincaré-Steklov (HPS) discretization, which couples high-order local discretizations with sparse direct solvers. It introduces GPU-accelerated leaf computations via batched BLAS and compares two sparse-solver paths for the reduced interface system: a black-box solver with nested dissection and SlabLU, a two-level slab-based method with batched parallelism and $\mathcal{H}$-matrix-like structure. The key contributions are demonstrating substantial GPU speedups for the costly leaf condensation (scaling as $O(p^4 N)$) up to $p=42$, and showing that SlabLU offers favorable build-time and memory footprints while maintaining high accuracy, enabling 2D problems up to $N \approx 8.1\times 10^{7}$ unknowns. The results indicate that high-order HPS can resolve high-frequency scattering on curved and rectangular domains with significant practical impact and potential extension to 3D, albeit with increased complexity in $p$-scaling.
Abstract
This manuscript presents GPU optimizations for the 2D Hierarchical Poincaré-Steklov (HPS) discretization scheme. HPS is a multi-domain spectral collocation method that combines high-order discretizations with direct solvers to accurately resolve highly oscillatory solutions. The domain decomposition approach of HPS connects domains directly via a sparse direct solver. The proposed optimizations exploit batched linear algebra on modern hybrid architectures, are straightforward to implement, and improve the solver's practical speed. The manuscript demonstrates that GPU optimizations can significantly reduce the traditionally high cost of performing local static condensation for discretizations with very high local order $p$. Numerical experiments for the Helmholtz equation with high wavenumbers on curved and rectangular domains confirm the high accuracy achieved by the HPS discretization and the significant reduction in computation time achieved with GPU optimizations.
