Table of Contents
Fetching ...

Hardware Acceleration for HPS Algorithms in Two and Three Dimensions

Owen Melia, Daniel Fortunato, Jeremy Hoskins, Rebecca Willett

TL;DR

The work develops a GPU-accelerated framework for the hierarchical Poincaré–Steklov ($HPS$) direct solvers for variable-coefficient elliptic PDEs, addressing both two- and three-dimensional problems. It introduces a 2D leaf-recomputation strategy to minimize data transfers and a 3D adaptive discretization to dramatically reduce peak memory, all within a high-order spectral collocation and nested-dissection structure. An open-source JAX implementation enables seamless automatic differentiation, enabling forward and inverse problems such as high-frequency wave scattering and linearized Poisson–Boltzmann simulations. The results show substantial GPU speedups and memory savings, demonstrating the practicality of high-order, GPU-accelerated $HPS$ solvers for large-scale 2D/3D applications with potential impact on optimization, inverse problems, and scientific computing workflows.

Abstract

We provide a flexible, open-source framework for hardware acceleration, namely massively-parallel execution on general-purpose graphics processing units (GPUs), applied to the hierarchical Poincaré--Steklov (HPS) family of algorithms for building fast direct solvers for linear elliptic partial differential equations. To take full advantage of the power of hardware acceleration, we propose two variants of HPS algorithms to improve performance on two- and three-dimensional problems. In the two-dimensional setting, we introduce a novel recomputation strategy that minimizes costly data transfers to and from the GPU; in three dimensions, we modify and extend the adaptive discretization technique of Geldermans and Gillman [2019] to greatly reduce peak memory usage. We provide an open-source implementation of these methods written in JAX, a high-level accelerated linear algebra package, which allows for the first integration of a high-order fast direct solver with automatic differentiation tools. We conclude with extensive numerical examples showing our methods are fast and accurate on two- and three-dimensional problems.

Hardware Acceleration for HPS Algorithms in Two and Three Dimensions

TL;DR

The work develops a GPU-accelerated framework for the hierarchical Poincaré–Steklov () direct solvers for variable-coefficient elliptic PDEs, addressing both two- and three-dimensional problems. It introduces a 2D leaf-recomputation strategy to minimize data transfers and a 3D adaptive discretization to dramatically reduce peak memory, all within a high-order spectral collocation and nested-dissection structure. An open-source JAX implementation enables seamless automatic differentiation, enabling forward and inverse problems such as high-frequency wave scattering and linearized Poisson–Boltzmann simulations. The results show substantial GPU speedups and memory savings, demonstrating the practicality of high-order, GPU-accelerated solvers for large-scale 2D/3D applications with potential impact on optimization, inverse problems, and scientific computing workflows.

Abstract

We provide a flexible, open-source framework for hardware acceleration, namely massively-parallel execution on general-purpose graphics processing units (GPUs), applied to the hierarchical Poincaré--Steklov (HPS) family of algorithms for building fast direct solvers for linear elliptic partial differential equations. To take full advantage of the power of hardware acceleration, we propose two variants of HPS algorithms to improve performance on two- and three-dimensional problems. In the two-dimensional setting, we introduce a novel recomputation strategy that minimizes costly data transfers to and from the GPU; in three dimensions, we modify and extend the adaptive discretization technique of Geldermans and Gillman [2019] to greatly reduce peak memory usage. We provide an open-source implementation of these methods written in JAX, a high-level accelerated linear algebra package, which allows for the first integration of a high-order fast direct solver with automatic differentiation tools. We conclude with extensive numerical examples showing our methods are fast and accurate on two- and three-dimensional problems.

Paper Structure

This paper contains 36 sections, 2 theorems, 50 equations, 13 figures, 3 tables, 7 algorithms.

Key Result

Theorem D.1

Let $u_\theta$ solve eq:forward_scattering_problem with scattering potential $q=q_\theta$. Let $w$ solve Then

Figures (13)

  • Figure 1: Visualizing the high-order composite spectral collocation scheme for a simple two-dimensional problem. \ref{['fig:chebyshev_points']} shows the Chebyshev points on a two-dimensional problem with polynomial order $p=8$ and $L=1$ level of refinement. \ref{['fig:quad_merge_int_ext_points']} shows the Gauss--Lobatto points discretizing the boundaries of the leaves using order $q=6$. When merging nodes together, it is important to distinguish between the exterior boundary points, drawn with blue dots, and the interior boundary points, drawn with red $\times$'s.
  • Figure 2: Comparing the batching patterns of the two different recomputation strategies. The leaf recomputation strategy in \ref{['fig:naive_recomputation']} performs the local solve stage operations in large batches and then performs all of the merge stages in a separate batch. Our proposed subtree recomputation strategy (\ref{['fig:our_recomputation']}) performs local solves and multiple levels of the merge stage for a complete subtree of the discretization tree structure.
  • Figure 3: Even with a naïve implementation of the HPS algorithm which does not perform any recomputation, using a single GPU achieves large speedups over a multicore CPU system. When we use our proposed subtree recomputation strategy, the speedup increases by another factor of two. (Left) We vary $L=4, \dots, 9$ and hold $p=16$ fixed to generate problems with $N=p^2 4^L = 256 \times 4^L = 4^{L+4}$ degrees of freedom. We measure the total runtime of our 2D method merging DtN matrices; this runtime includes the execution of the local solve stage, the merge stage, and the downward pass. Vertical error bars show $\pm$ 1 standard error computed over five trials. (Right) For each of the GPU implementations, we compute the total number of FLOPS and report this as a percentage of the GPU's peak FLOPS, estimated by the manufacturer to be $34 \times 10^{12}$.
  • Figure 4: Using $p$ Chebyshev points per dimension on each leaf, and leaves of size $h$, the relative $\ell_\infty$ errors of our method converge at rate $O(h^{p-2})$. \ref{['fig:hp_convergence_DtN']} shows the convergence of the HPS method using DtN matrices applied to \ref{['eq:DtN_problem']} and \ref{['fig:hp_convergence_ItI']} shows the convergence of the HPS method using ItI matrices applied to \ref{['eq:ItI_problem']}.
  • Figure 5: Visualizing the solutions of forward scattering problems for the single Gaussian bump scattering potential. \ref{['fig:k100_utot_GBM_1', 'fig:k200_utot_GBM_1']} show the real part of the total wave field.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem D.1: Theorem 3.1 of borges_high_2016
  • Theorem D.2: Theorem 3.2 of borges_high_2016