Table of Contents
Fetching ...

Exploiting nested task-parallelism in the $\mathcal{H}-LU$ factorization

Rocío Carratalá-Sáez, Sven Christophersen, José I. Aliaga, Vicenç Beltran, Steffen Börm, Enrique S. Quintana-Ortí

TL;DR

This work tackles the parallelization of the LU factorization for $\mathcal{H}$-matrices arising in boundary element methods by exploiting dynamic task-parallelism with the OmpSs-2 runtime. It introduces a skeleton data structure to decouple dependency analysis from the dynamic, non-contiguous HL data layout and leverages OmpSs-2 features of weak dependencies and early release to unlock nested, fine-grained parallelism. The approach yields significant performance gains on a 24-core Intel Xeon system, outperforming kernel- and coarse-grained parallelization strategies, especially when low-rank blocks are present. The results demonstrate practical impact for high-performance hierarchical linear algebra, enabling scalable factorization of large, data-sparse HL matrices in boundary integral applications.

Abstract

We address the parallelization of the LU factorization of hierarchical matrices ($\mathcal{H}$-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for $\mathcal{H}$-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks.

Exploiting nested task-parallelism in the $\mathcal{H}-LU$ factorization

TL;DR

This work tackles the parallelization of the LU factorization for -matrices arising in boundary element methods by exploiting dynamic task-parallelism with the OmpSs-2 runtime. It introduces a skeleton data structure to decouple dependency analysis from the dynamic, non-contiguous HL data layout and leverages OmpSs-2 features of weak dependencies and early release to unlock nested, fine-grained parallelism. The approach yields significant performance gains on a 24-core Intel Xeon system, outperforming kernel- and coarse-grained parallelization strategies, especially when low-rank blocks are present. The results demonstrate practical impact for high-performance hierarchical linear algebra, enabling scalable factorization of large, data-sparse HL matrices in boundary integral applications.

Abstract

We address the parallelization of the LU factorization of hierarchical matrices (-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for -matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks.

Paper Structure

This paper contains 21 sections, 16 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: $2\times 2$ partitioning of a simple $\mathcal{H}$-matrix.
  • Figure 2: Data dependencies in the blocked RL algorithm for the $\mathcal{H}$-LU factorization.
  • Figure 3: Alternative $2\times 2$ partitioning of a simple $\mathcal{H}$-matrix.
  • Figure 4: Data dependencies between tasks O1 and O2 of the blocked RL algorithm for the $\mathcal{H}$-LU factorization. The black solid lines specify "internal" strong dependencies; the pink solid lines, strong dependencies crossing task boundaries; and the blue dashed line, the weak dependency.
  • Figure 5: Performance of the matrix-matrix multiplication routine in Intel MKL.
  • ...and 5 more figures