Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

Fabian Böhm; Daniel Bauer; Nils Kohl; Christie Alappat; Dominik Thönnes; Marcus Mohr; Harald Köstler; Ulrich Rüde

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

Fabian Böhm, Daniel Bauer, Nils Kohl, Christie Alappat, Dominik Thönnes, Marcus Mohr, Harald Köstler, Ulrich Rüde

TL;DR

This work introduces the HyTeG Operator Generator (HOG), a code-generation pipeline that produces matrix-free finite element kernels tailored to block-structured, hybrid tetrahedral grids. By transforming weak forms into optimized ASTs and applying techniques such as inter-element vectorization, loop invariant removal, and tabulation of weak-form factors, HOG delivers substantial speed-ups (up to 58×) and high node-level throughput (up to 2.1 GDoF/s, reaching ~62% of peak on Intel Ice Lake). The approach is demonstrated across three operators, including curl-curl and variable-coefficient diffusion, and is integrated with a full geometric multigrid solver to achieve extreme-scale performance (>$10^{12}$ DoFs on 21,504 processes in under 50 seconds). The work highlights the value of performance engineering tools like roofline analysis for guiding automated kernel generation and paves the way for future extensions to curvilinear geometries, DG methods, and accelerator backends.

Abstract

This paper introduces a code generator designed for node-level optimized, extreme-scalable, matrix-free finite element operators on hybrid tetrahedral grids. It optimizes the local evaluation of bilinear forms through various techniques including tabulation, relocation of loop invariants, and inter-element vectorization - implemented as transformations of an abstract syntax tree. A key contribution is the development, analysis, and generation of efficient loop patterns that leverage the local structure of the underlying tetrahedral grid. These significantly enhance cache locality and arithmetic intensity, mitigating bandwidth-pressure associated with compute-sparse, low-order operators. The paper demonstrates the generator's capabilities through a comprehensive educational cycle of performance analysis, bottleneck identification, and emission of dedicated optimizations. For three differential operators ($-Δ$, $-\nabla \cdot (k(\mathbf{x})\, \nabla\,)$, $α(\mathbf{x})\, \mathbf{curl}\ \mathbf{curl} + β(\mathbf{x}) $), we determine the set of most effective optimizations. Applied by the generator, they result in speed-ups of up to 58$\times$ compared to reference implementations. Detailed node-level performance analysis yields matrix-free operators with a throughput of 1.3 to 2.1 GDoF/s, achieving up to 62% peak performance on a 36-core Intel Ice Lake socket. Finally, the solution of the curl-curl problem with more than a trillion ($ 10^{12}$) degrees of freedom on 21504 processes in less than 50 seconds demonstrates the generated operators' performance and extreme-scalability as part of a full multigrid solver.

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

TL;DR

DoFs on 21,504 processes in under 50 seconds). The work highlights the value of performance engineering tools like roofline analysis for guiding automated kernel generation and paves the way for future extensions to curvilinear geometries, DG methods, and accelerator backends.

Abstract

), we determine the set of most effective optimizations. Applied by the generator, they result in speed-ups of up to 58

compared to reference implementations. Detailed node-level performance analysis yields matrix-free operators with a throughput of 1.3 to 2.1 GDoF/s, achieving up to 62% peak performance on a 36-core Intel Ice Lake socket. Finally, the solution of the curl-curl problem with more than a trillion (

) degrees of freedom on 21504 processes in less than 50 seconds demonstrates the generated operators' performance and extreme-scalability as part of a full multigrid solver.

Paper Structure (30 sections, 8 equations, 11 figures, 2 tables, 3 algorithms)

This paper contains 30 sections, 8 equations, 11 figures, 2 tables, 3 algorithms.

Introduction
Contribution
Related Work
Structure
Hybrid Tetrahedral Grids
Code Generation for FEM on Hybrid Tetrahedral Grids
Matrix-Free Finite Elements
Optimizing Cache-Locality: Loop Strategies
Sawtooth Loop Strategy
Cubes Loop Strategy
Memory Volume and Layer Conditions
Memory Study
Optimizing Computations
Automatic Identification of Loop Invariants
Inter-element Vectorization
...and 15 more sections

Figures (11)

Figure 1: Performance of generated matrix-free matrix-vector multiplication kernels for a variable-coefficient diffusion operator $-\nabla \cdot (k(\mathbf{x})\, \nabla\,)$ discretized with quadratic conforming elements after application of individual performance optimizations. The final operator exhibits an accumulated speed-up of 58$\times$. Detailed discussion in \ref{['sec:PerfEng']}.
Figure 2: Six types of micro-elements with different orientations in space, illustrated on refinement level $\ell = 2$. Each group is denoted by a unique symbol to streamline the presentation of algorithms. Combined, they constitute the complete macro tetrahedron. Micros of a certain orientation are, as visible, translation invariant. Depending on the orientation, the macro-tetrahedron fits a differing number of micro-elements, e.g., four -elements against only two -elements in the longest row. See Kohl:2023:FundamentalDataStructures for details.
Figure 3: Code generation pipeline of the HOG. A scalable, matrix-free operator for hybrid tetrahedral grids is generated and optimized from simple input parameters. The resulting kernel is embedded into the HyTeGFE framework Kohl:2019:HyTeGFiniteelementSoftwareKohl:2023:FundamentalDataStructures.
Figure 4: Left: Micro-elements of the - (purple to yellow gradient) and -type (in red) share vertex and edge DoFs. The DoFs are loaded from main memory during the -element loop, which then iterates the remaining macro-tetrahedron. The overlapping DoFs are evicted from cache in the process for practically relevant refinement levels. The following -element-loop iterates over the red micro-elements and has to reload the overlapping DoFs from main memory. Middle: The 6 element loops of the sawtooth loop strategy are fused together. Red micro elements lie outside the macro-tetrahedron and must be omitted through conditionals and by splitting the iteration space into complete and incomplete iterations. Right: The vertex-DoF in the lower front-right corner is part of multiple micro tetrahedra and will be loaded only once from main memory with the cubes loop strategy.
Figure 5: Categorization of vertex and edge DoFs on the current iteration (red) of a sawtooth (left) and cubes loop (right): light green DoFs are always cached from the previous iteration (light green elements), red DoFs are never cached because they lie ahead of the iteration order, teal and blue DoFs are cached depending on the presence of data from the corresponding neighbor elements in cache.
...and 6 more figures

Theorems & Definitions (3)

Remark 1
Remark 2
Remark 3

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

TL;DR

Abstract

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

Authors

TL;DR

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (3)