Table of Contents
Fetching ...

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

Lukas Gianinazzi, Alexandros Nikolaos Ziogas, Langwen Huang, Piotr Luczynski, Saleh Ashkboos, Florian Scheidl, Armon Carigiet, Chio Ge, Nabil Abubaker, Maciej Besta, Tal Ben-Nun, Torsten Hoefler

TL;DR

This work decomposes the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations and enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs.

Abstract

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

TL;DR

This work decomposes the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations and enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs.

Abstract

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
Paper Structure (43 sections, 11 theorems, 4 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 43 sections, 11 theorems, 4 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

For $x = \frac{bm}{ \max_i \lambda_{\pi_i'}(G_i')}$, LA-Decompose($\mathbf{A}$, $b$) computes an $x$-compacting $b$-arrow matrix decomposition.

Figures (6)

  • Figure 1: Non-zero structure of the first matrix $\mathbf{B_0}$ in an arrow matrix decomposition for matrices from the SuiteSparse Matrix Collection. The color indicates the number of non-zeros per row; white blocks are empty. Each block has $5$ million rows.
  • Figure 2: In a distribution of an arrow matrix $\mathbf{B}$, each tile of $\mathbf{B}$ is $b\times b$ and each tile of $\mathbf{D}$ and $\mathbf{C}$ is $b\times k$. The numbers indicate the process ranks holding or contributing to the tile.
  • Figure 3: LA-Decompose produces a linear arrangement $\pi_0$ of the vertices of the graph that corresponds to the sparsity structure of the input matrix. This creates three parts in the matrix (1) A flipped 'L' shape that contains the highest degree vertices (in blue), (2) a band around the diagonal (in green), and (3) the remainder (in red). The first two parts form the first matrix $\mathbf{B_0}$ of the decomposition. The rest of the decomposition proceeds recursively on the remainder.
  • Figure 4: Weak scaling of the 1D / 1.5D baseline for varying replication factors $c$ on the MAWI datasets.
  • Figure 5: Strong scaling results for varying features sizes.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Corollary 1
  • Lemma 4
  • Theorem 1
  • Lemma 5
  • Corollary 2
  • Lemma 6
  • Theorem 2
  • ...and 1 more