Table of Contents
Fetching ...

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

M. Croci, G. N. Wells

TL;DR

This paper develops the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly and introduces hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations.

Abstract

In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations. Indeed, these algorithms are accurate to the lower-precision unit roundoff with an error constant that is independent from: the conditioning of FE basis function evaluations, the ill-posedness of the cell, the polynomial degree, and the number of quadrature nodes. Consequently, we present the first AMX-accelerated FE kernel implementations on Intel Sapphire Rapids CPUs. Numerical experiments demonstrate that the proposed mixed- (single/half-) precision algorithms are up to 60 times faster than their double precision equivalent while being orders of magnitude more accurate than their fully half-precision counterparts.

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

TL;DR

This paper develops the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly and introduces hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations.

Abstract

In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations. Indeed, these algorithms are accurate to the lower-precision unit roundoff with an error constant that is independent from: the conditioning of FE basis function evaluations, the ill-posedness of the cell, the polynomial degree, and the number of quadrature nodes. Consequently, we present the first AMX-accelerated FE kernel implementations on Intel Sapphire Rapids CPUs. Numerical experiments demonstrate that the proposed mixed- (single/half-) precision algorithms are up to 60 times faster than their double precision equivalent while being orders of magnitude more accurate than their fully half-precision counterparts.

Paper Structure

This paper contains 44 sections, 22 theorems, 128 equations, 7 figures, 1 table.

Key Result

Lemma 2.1

If $|\delta_i|\leq u$ and $\rho_i=\pm 1$ for $i=1,\dots,n$ and $nu<1$, then If $u$ is sufficiently small, then there exists $c \geq 1$ such that $\gamma_n\leq cu$.

Figures (7)

  • Figure 1: Pie charts of the computational cost distribution of different Poisson kernel subroutines on hexahedra. The results, entirely run in double precision, are shown for different polynomial degrees $p$. The top and bottom rows correspond to the Poisson bilinear form and action respectively. By "mat-mats" we indicate the cost of accumulating the matrix-matrix products into $A$ and $\bm{v}$ and by "FE feval" we denote the evaluation of $\nabla \check{w}({\check{\bm{x}}})$.
  • Figure 2: Relative rounding errors in the fp32 evaluation of the Poisson geometry as a function of the condition number of the reference map Jacobian. Fitting a line to the errors in the loglog plot yields the linear proportionality expected from Theorem \ref{['th:poisson_geometry']}.
  • Figure 3: Relative rounding errors in mass kernel evaluations on tetrahedra versus the number of quadrature nodes $n_q$. Here the fp16 format is used in place of the precision $u_q$ and single precision is used in everything else.
  • Figure 4: Relative rounding errors arising in the mass kernel evaluations via fp16 and fp32/bf16 mixed-precision plotted against the polynomial degree. The AVX512-bf16 and AMX-bf16 are almost overlapping. Note that the fp16 errors grow with $p$ while the mixed-precision errors are constant.
  • Figure 5: Relative rounding errors arising in the Poisson kernel evaluations via fp16 and fp32/bf16 mixed-precision plotted against the polynomial degree. The AVX512-bf16 and AMX-bf16 are almost overlapping for bilinear forms (top figures). The slight difference in the error constant of AVX512-bf16 and AMX-bf16 computations in the action kernels (bottom figures) is due to the different implementation: the AVX512-bf16 kernels are slightly more accurate since they evaluate FE functions using only single precision (cf. Appendix \ref{['appendix_sec:implementation_details']}).
  • ...and 2 more figures

Theorems & Definitions (59)

  • Definition 2.1
  • Definition 2.2
  • Lemma 2.1: Lemma 3.1 in higham2002accuracy
  • Lemma 2.2: Lemma 3.3 in higham2002accuracy
  • Theorem 2.3: Theorem 9.3, Section 9, and Section 14.6 in higham2002accuracy
  • Theorem 2.4: Corollary 2.14 in ipsen2008perturbation
  • Corollary 2.5
  • proof
  • Remark 2.1
  • Example 2.1
  • ...and 49 more