Table of Contents
Fetching ...

Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

Ishna Satyarth, Chao Yin, Devin A. Matthews, Maggie Myers, Robert van de Geijn, RuQing G. Xu

TL;DR

The paper tackles the challenge of efficiently factorizing skew-symmetric matrices via the $L T L^T$ form on shared-memory architectures. It employs the FLAME methodology to systematically derive a family of unblocked and blocked algorithms, including fused right-looking and left-looking variants, with and without pivoting. The work introduces new level-2 and level-3 BLAS-like operations, demonstrates their implementation in a prototype C++ FLAME-like API, and shows substantial performance improvements over prior PFAPACK/Pfaffine implementations as well as competitive results with symmetric factorizations. These results highlight the practical viability of high-performance skew-symmetric factorizations and lay groundwork for broader FLAME-based algorithm design including pivoting and tensor extensions.

Abstract

The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra, particularly in comparison to that of general and symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. This work examines the factorization of a skew-symmetric matrix $X$ into its $LTL^\mathrm{T}$ decomposition, where $L$ is unit lower triangular and $T$ is tridiagonal. This is also known as a triangular tridiagonalization. This operation is a means for computing the determinant of $X$ as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix $T$ as well as for solving systems of equations, across fields such as quantum electronic structure and machine learning. Its application also often requires pivoting in order to improve numerical stability. We compare and contrast previously-published algorithms with those systematically derived using the FLAME methodology. Performant parallel CPU implementations are achieved by fusing operations at multiple levels in order to reduce memory traffic overhead. A key factor is the employment of new capabilities of the BLAS-like Library Instantiation Software (BLIS) framework, which now supports casting level-2 and level-3 BLAS-like operations by leveraging its gemm and other kernels, hierarchical parallelism, and cache blocking. A prototype, concise C++ API facilitates the translation of correct-by-construction algorithms into correct code. Experiments verify that the resulting implementations greatly exceed the performance of previous work.

Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

TL;DR

The paper tackles the challenge of efficiently factorizing skew-symmetric matrices via the form on shared-memory architectures. It employs the FLAME methodology to systematically derive a family of unblocked and blocked algorithms, including fused right-looking and left-looking variants, with and without pivoting. The work introduces new level-2 and level-3 BLAS-like operations, demonstrates their implementation in a prototype C++ FLAME-like API, and shows substantial performance improvements over prior PFAPACK/Pfaffine implementations as well as competitive results with symmetric factorizations. These results highlight the practical viability of high-performance skew-symmetric factorizations and lay groundwork for broader FLAME-based algorithm design including pivoting and tensor extensions.

Abstract

The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra, particularly in comparison to that of general and symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. This work examines the factorization of a skew-symmetric matrix into its decomposition, where is unit lower triangular and is tridiagonal. This is also known as a triangular tridiagonalization. This operation is a means for computing the determinant of as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix as well as for solving systems of equations, across fields such as quantum electronic structure and machine learning. Its application also often requires pivoting in order to improve numerical stability. We compare and contrast previously-published algorithms with those systematically derived using the FLAME methodology. Performant parallel CPU implementations are achieved by fusing operations at multiple levels in order to reduce memory traffic overhead. A key factor is the employment of new capabilities of the BLAS-like Library Instantiation Software (BLIS) framework, which now supports casting level-2 and level-3 BLAS-like operations by leveraging its gemm and other kernels, hierarchical parallelism, and cache blocking. A prototype, concise C++ API facilitates the translation of correct-by-construction algorithms into correct code. Experiments verify that the resulting implementations greatly exceed the performance of previous work.

Paper Structure

This paper contains 48 sections, 5 theorems, 19 equations, 11 figures.

Key Result

Theorem 2.2

Let matrix $X \in \mathbb{R}^{m \times m}$ be partitioned as $X = \left( \right)$, where $X_{TL}$ is square. Then $X$ is skew-symmetric iff

Figures (11)

  • Figure 1: The unblocked right-looking (modified Parlett-Reid) and left-looking (modified Aasen) algorithms. Portions in red are only required when pivoting is desired. See Section \ref{['sec:pivoting']} for definitions related to pivoting.
  • Figure 1: The FLAME methodology workflow.
  • Figure 1: C++ implementation of the blocked right-looking algorithm of Figure \ref{['fig:LTLt_blk']}. The use of ranges and range partitioning allows for strongly-typed identification of sub-matrices, vectors, and scalars. The tridiagonal factor $T$ is stored as an external vector t containing the sub-diagonal elements, while the triangular factor $L$ is stored in the lower part of $X$ with an implicit column shift. The diagonal elements of $L$ are stored as explicit "1"s in order to enable combined updates such as the skew_tridiag_rankk which involves $\lambda_{33}=1$.
  • Figure 1: Performance of blocked right-looking algorithms with the unblocked left-looking algorithm applied within blocks. Fusion and other optimizations are applied in steps (see text). ace) without pivoting; bdf) with pivoting. The algorithmic block size is ab) 128, cd) 192, ef) 256, and hi) 512. All calculations use 64 cores (1 socket). Notice the that the y-axis scaling is different for the graphs on the right.
  • Figure 2: Two-step right-looking unblocked algorithm.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 2.1
  • Theorem 2.2
  • Theorem 2.3
  • Proof 1
  • Theorem 2.4
  • Definition 2.5
  • Lemma 2.6
  • Theorem 2.7
  • Definition 3.1
  • Definition 3.2