Table of Contents
Fetching ...

Architecture-aware $h$-to-$p$ optimisation: spectral/$hp$ element operators for mixed-element meshes

Jacques Y. Xing, Boyang Xia, Diego Renner, Chris D. Cantwell, David Moxey, Robert M. Kirby, Spencer J. Sherwin

Abstract

We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix. This approach seeks to maximise operations using the collocation properties of the nodal tensorial expansion associated with classical quadrature rules. Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.

Architecture-aware $h$-to-$p$ optimisation: spectral/$hp$ element operators for mixed-element meshes

Abstract

We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix. This approach seeks to maximise operations using the collocation properties of the nodal tensorial expansion associated with classical quadrature rules. Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.

Paper Structure

This paper contains 28 sections, 27 equations, 14 figures, 6 algorithms.

Figures (14)

  • Figure 1: Representation of a high-order triangular element with three different coordinate systems. Elements may be curvilinear or benefit from reduced geometric information if regular straight-sided. The distribution of quadrature points is illustrated with equispaced lines.
  • Figure 2: Nektar++ execution space model hierarchy. Execution spaces are shown in green, while specific backends are shown in blue. Availability of backends depends on the host and/or device architecture.
  • Figure 3: Diagram showing the SumFac implementation on GPU. The four grey dots are stored contiguously in the memory and accessed by different threads. For illustrative purposes, we suppose there are 4 threads in each work group. Each work group processes a different element group.
  • Figure 4: Diagram showing the SumFacTOP implementation on GPU. The four grey dots are stored contiguously in memory and accessed by different threads. For illustrative purposes, we suppose there are four threads in each work group. Each work group processes a different element.
  • Figure 5: (a) NVIDIA GH200 Grace Hopper Superchip and (b) Intel Xeon 6526Y throughput performance versus elemental degrees of freedom for the mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements. For different polynomial degree (P) the standard matrix (StdMat) approach is labelled in red, the vectorised sum-factorisation (SumFac) is labelled in blue and the sum-factorisation threaded on output point (SumFacTOP) is labelled in green.
  • ...and 9 more figures