Table of Contents
Fetching ...

Performance evaluation of accelerated complex multiple-precision LU decomposition

Tomonori Kouya

TL;DR

This work tackles the performance bottlenecks of complex LU decomposition in multi-precision arithmetic by integrating the $3M$ complex multiplication method, AVX2 SIMD acceleration, and OpenMP parallelization within a multi-component $DD$, $TD$, and $QD$ precision framework built on MPLAPACK/MPBLAS. It systemically benchmarks complex matrix multiplication using multiple algorithms (simple, block, Strassen, Ozaki) and demonstrates that, across precisions, the AVX2 SIMDized normal LU decomposition generally provides the fastest performance, with serial gains up to approximately $726\times$ over MPLAPACK for $DD$ and around $91\times$ for $QD$ when using many threads. The results reveal clear trade-offs: Strassen and Ozaki can offer benefits in specific regimes, but OpenMP-accelerated block multiplication can dominate in parallel settings. The study advances reproducible, high-precision linear solvers for dense systems and informs algorithm choices based on precision and parallelization capabilities, with practical impact on scientific computing workflows requiring reliable multi-precision solutions.

Abstract

The direct method is one of the most important algorithms for solving linear systems of equations, with LU decomposition comprising a significant portion of its computation time. This study explores strategies to accelerate complex LU decomposition using multiple-precision floating-point arithmetic of the multiple-component type. Specifically, we explore the potential efficiency gains using a combination of SIMDization and the 3M method for complex matrix multiplication. Our benchmark tests compare this approach with the direct method implementation in MPLAPACK, focusing on computation time and numerical errors.

Performance evaluation of accelerated complex multiple-precision LU decomposition

TL;DR

This work tackles the performance bottlenecks of complex LU decomposition in multi-precision arithmetic by integrating the complex multiplication method, AVX2 SIMD acceleration, and OpenMP parallelization within a multi-component , , and precision framework built on MPLAPACK/MPBLAS. It systemically benchmarks complex matrix multiplication using multiple algorithms (simple, block, Strassen, Ozaki) and demonstrates that, across precisions, the AVX2 SIMDized normal LU decomposition generally provides the fastest performance, with serial gains up to approximately over MPLAPACK for and around for when using many threads. The results reveal clear trade-offs: Strassen and Ozaki can offer benefits in specific regimes, but OpenMP-accelerated block multiplication can dominate in parallel settings. The study advances reproducible, high-precision linear solvers for dense systems and informs algorithm choices based on precision and parallelization capabilities, with practical impact on scientific computing workflows requiring reliable multi-precision solutions.

Abstract

The direct method is one of the most important algorithms for solving linear systems of equations, with LU decomposition comprising a significant portion of its computation time. This study explores strategies to accelerate complex LU decomposition using multiple-precision floating-point arithmetic of the multiple-component type. Specifically, we explore the potential efficiency gains using a combination of SIMDization and the 3M method for complex matrix multiplication. Our benchmark tests compare this approach with the direct method implementation in MPLAPACK, focusing on computation time and numerical errors.
Paper Structure (14 sections, 4 equations, 17 figures, 1 table)

This paper contains 14 sections, 4 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Data structure for multiple-precision complex vectors $\mathbf{v}\in\mathbb{C}^3$ and matrices $A\in\mathbb{C}^{3\times 3}$
  • Figure 2: Complex linear computation with AVX2
  • Figure 3: The parallelized Strassen algorithm
  • Figure 4: The parallelized Ozaki scheme
  • Figure 5: DD-prec.: Computation time in seconds (left) and speed-up ratio when using 32 threads (right)
  • ...and 12 more figures