Table of Contents
Fetching ...

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà

TL;DR

This paper proposes a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine, which decreases the computational cost to the time required byStrassen’s algorithm, amounting to floating point operations.

Abstract

The multiplication of a matrix by its transpose, $A^T A$, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to $\frac{2}{3}$ the time required by Strassen's algorithm, amounting to $\frac{14}{3}n^{\log_2 7}$ floating point operations. ATA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of ATA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of $A^T A$. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

TL;DR

This paper proposes a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine, which decreases the computational cost to the time required byStrassen’s algorithm, amounting to floating point operations.

Abstract

The multiplication of a matrix by its transpose, , appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to the time required by Strassen's algorithm, amounting to floating point operations. ATA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of ATA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of . Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

Paper Structure

This paper contains 25 sections, 3 theorems, 10 equations, 6 figures, 1 table, 3 algorithms.

Key Result

Proposition 3.1

The cache complexity of AtA, $C_{\textsc{AtA}\xspace}(n; M, b)$, is the same as the cache complexity of Strassen, $C_S(n; M, b) = \Theta(1 + n^2/b + n^{\log_2(7)}/b\sqrt{M})$, frigo1999cache.

Figures (6)

  • Figure 1: A tree of 16 processes distributing $A \in \mathbb{R}^{n\times n}$. Boxed labels on the right-hand side are the leaf nodes of the tree generated by AtA-S, corresponding to computation tasks assigned to corresponding processes in the left-hand side leaf labels.
  • Figure 2: Multiplication with vertical/horizontal tiling.
  • Figure 3: AtA vs Intel MKL dsyrk
  • Figure 4: FastStrassen vs Intel MKL dgemm
  • Figure 5: Experimental results of AtA-S and Intel MKL dsyrk in terms of elapsed time in seconds (left column) and effective GFLOPs (right column), varying the number of available cores $P$ on fixed matrix sizes with a 16 threads configuration.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition 4.1
  • Proposition 4.2