Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures
Ishna Satyarth, Chao Yin, Devin A. Matthews, Maggie Myers, Robert van de Geijn, RuQing G. Xu
TL;DR
The paper tackles the challenge of efficiently factorizing skew-symmetric matrices via the $L T L^T$ form on shared-memory architectures. It employs the FLAME methodology to systematically derive a family of unblocked and blocked algorithms, including fused right-looking and left-looking variants, with and without pivoting. The work introduces new level-2 and level-3 BLAS-like operations, demonstrates their implementation in a prototype C++ FLAME-like API, and shows substantial performance improvements over prior PFAPACK/Pfaffine implementations as well as competitive results with symmetric factorizations. These results highlight the practical viability of high-performance skew-symmetric factorizations and lay groundwork for broader FLAME-based algorithm design including pivoting and tensor extensions.
Abstract
The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra, particularly in comparison to that of general and symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. This work examines the factorization of a skew-symmetric matrix $X$ into its $LTL^\mathrm{T}$ decomposition, where $L$ is unit lower triangular and $T$ is tridiagonal. This is also known as a triangular tridiagonalization. This operation is a means for computing the determinant of $X$ as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix $T$ as well as for solving systems of equations, across fields such as quantum electronic structure and machine learning. Its application also often requires pivoting in order to improve numerical stability. We compare and contrast previously-published algorithms with those systematically derived using the FLAME methodology. Performant parallel CPU implementations are achieved by fusing operations at multiple levels in order to reduce memory traffic overhead. A key factor is the employment of new capabilities of the BLAS-like Library Instantiation Software (BLIS) framework, which now supports casting level-2 and level-3 BLAS-like operations by leveraging its gemm and other kernels, hierarchical parallelism, and cache blocking. A prototype, concise C++ API facilitates the translation of correct-by-construction algorithms into correct code. Experiments verify that the resulting implementations greatly exceed the performance of previous work.
