Performance evaluation of accelerated complex multiple-precision LU decomposition

Tomonori Kouya

Performance evaluation of accelerated complex multiple-precision LU decomposition

Tomonori Kouya

TL;DR

This work tackles the performance bottlenecks of complex LU decomposition in multi-precision arithmetic by integrating the $3M$ complex multiplication method, AVX2 SIMD acceleration, and OpenMP parallelization within a multi-component $DD$, $TD$, and $QD$ precision framework built on MPLAPACK/MPBLAS. It systemically benchmarks complex matrix multiplication using multiple algorithms (simple, block, Strassen, Ozaki) and demonstrates that, across precisions, the AVX2 SIMDized normal LU decomposition generally provides the fastest performance, with serial gains up to approximately $726\times$ over MPLAPACK for $DD$ and around $91\times$ for $QD$ when using many threads. The results reveal clear trade-offs: Strassen and Ozaki can offer benefits in specific regimes, but OpenMP-accelerated block multiplication can dominate in parallel settings. The study advances reproducible, high-precision linear solvers for dense systems and informs algorithm choices based on precision and parallelization capabilities, with practical impact on scientific computing workflows requiring reliable multi-precision solutions.

Abstract

The direct method is one of the most important algorithms for solving linear systems of equations, with LU decomposition comprising a significant portion of its computation time. This study explores strategies to accelerate complex LU decomposition using multiple-precision floating-point arithmetic of the multiple-component type. Specifically, we explore the potential efficiency gains using a combination of SIMDization and the 3M method for complex matrix multiplication. Our benchmark tests compare this approach with the direct method implementation in MPLAPACK, focusing on computation time and numerical errors.

Performance evaluation of accelerated complex multiple-precision LU decomposition

TL;DR

This work tackles the performance bottlenecks of complex LU decomposition in multi-precision arithmetic by integrating the

complex multiplication method, AVX2 SIMD acceleration, and OpenMP parallelization within a multi-component

, and

precision framework built on MPLAPACK/MPBLAS. It systemically benchmarks complex matrix multiplication using multiple algorithms (simple, block, Strassen, Ozaki) and demonstrates that, across precisions, the AVX2 SIMDized normal LU decomposition generally provides the fastest performance, with serial gains up to approximately

over MPLAPACK for

and around

for

when using many threads. The results reveal clear trade-offs: Strassen and Ozaki can offer benefits in specific regimes, but OpenMP-accelerated block multiplication can dominate in parallel settings. The study advances reproducible, high-precision linear solvers for dense systems and informs algorithm choices based on precision and parallelization capabilities, with practical impact on scientific computing workflows requiring reliable multi-precision solutions.

Abstract

Paper Structure (14 sections, 4 equations, 17 figures, 1 table)

This paper contains 14 sections, 4 equations, 17 figures, 1 table.

Introduction
Acceleration of complex basic linear computation
The 3M method for complex linear computation
Acceleration using AVX2 SIMDization
Parallelization with OpenMP
Benchmark test for complex matrix multiplication
Acceleration of multiple-precision complex LU decomposition
Benchmark test of complex LU decomposition
DD-precision computation
QD-precision computing
TD-precision computing
Summary of high-performance complex LU decomposition
Conclusion and future works
Acknowledgment

Figures (17)

Figure 1: Data structure for multiple-precision complex vectors $\mathbf{v}\in\mathbb{C}^3$ and matrices $A\in\mathbb{C}^{3\times 3}$
Figure 2: Complex linear computation with AVX2
Figure 3: The parallelized Strassen algorithm
Figure 4: The parallelized Ozaki scheme
Figure 5: DD-prec.: Computation time in seconds (left) and speed-up ratio when using 32 threads (right)
...and 12 more figures

Performance evaluation of accelerated complex multiple-precision LU decomposition

TL;DR

Abstract

Performance evaluation of accelerated complex multiple-precision LU decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (17)