Table of Contents
Fetching ...

Diagonally-Addressed Matrix Nicknack: How to improve SpMV performance

Jens Saak, Jonas Schulze

TL;DR

This paper tackles the memory bandwidth bottleneck in SpMV by introducing Diagonally-Addressed Storage (DA) and applying it to the CSR format as DA-CSR. DA-CSR stores nonzero column indices as offsets from the diagonal, enabling the use of smaller index types and reducing matrix-related traffic without losing information. Across 1300+ bandwidth-reduced SuiteSparse matrices, the approach achieves notable performance gains, especially when traffic exceeds last-level cache size, with multithreaded improvements frequently exceeding 40% for certain matrices. The work demonstrates practical improvements for memory-bound sparse linear algebra workloads and highlights potential benefits for multi-precision SpMV, with publicly available code and data to facilitate adoption and further research.

Abstract

We suggest a technique to reduce the storage size of sparse matrices at no loss of information. We call this technique Diagonally-Adressed (DA) storage. It exploits the typically low matrix bandwidth of matrices arising in applications. For memory-bound algorithms, this traffic reduction has direct benefits for both uni-precision and multi-precision algorithms. In particular, we demonstrate how to apply DA storage to the Compressed Sparse Rows (CSR) format and compare the performance in computing the Sparse Matrix Vector (SpMV) product, which is a basic building block of many iterative algorithms. We investigate 1367 matrices from the SuiteSparse Matrix Collection fitting into the CSR format using signed 32 bit indices. More than 95% of these matrices fit into the DA-CSR format using 16 bit column indices, potentially after Reverse Cuthill-McKee (RCM) reordering. Using IEEE 754 double precision scalars, we observe a performance uplift of 11% (single-threaded) or 17.5% (multithreaded) on average when the traffic exceeds the size of the last-level CPU cache. The predicted uplift in this scenario is 20%. For traffic within the CPU's combined level 2 and level 3 caches, the multithreaded performance uplift is over 40% for a few test matrices.

Diagonally-Addressed Matrix Nicknack: How to improve SpMV performance

TL;DR

This paper tackles the memory bandwidth bottleneck in SpMV by introducing Diagonally-Addressed Storage (DA) and applying it to the CSR format as DA-CSR. DA-CSR stores nonzero column indices as offsets from the diagonal, enabling the use of smaller index types and reducing matrix-related traffic without losing information. Across 1300+ bandwidth-reduced SuiteSparse matrices, the approach achieves notable performance gains, especially when traffic exceeds last-level cache size, with multithreaded improvements frequently exceeding 40% for certain matrices. The work demonstrates practical improvements for memory-bound sparse linear algebra workloads and highlights potential benefits for multi-precision SpMV, with publicly available code and data to facilitate adoption and further research.

Abstract

We suggest a technique to reduce the storage size of sparse matrices at no loss of information. We call this technique Diagonally-Adressed (DA) storage. It exploits the typically low matrix bandwidth of matrices arising in applications. For memory-bound algorithms, this traffic reduction has direct benefits for both uni-precision and multi-precision algorithms. In particular, we demonstrate how to apply DA storage to the Compressed Sparse Rows (CSR) format and compare the performance in computing the Sparse Matrix Vector (SpMV) product, which is a basic building block of many iterative algorithms. We investigate 1367 matrices from the SuiteSparse Matrix Collection fitting into the CSR format using signed 32 bit indices. More than 95% of these matrices fit into the DA-CSR format using 16 bit column indices, potentially after Reverse Cuthill-McKee (RCM) reordering. Using IEEE 754 double precision scalars, we observe a performance uplift of 11% (single-threaded) or 17.5% (multithreaded) on average when the traffic exceeds the size of the last-level CPU cache. The predicted uplift in this scenario is 20%. For traffic within the CPU's combined level 2 and level 3 caches, the multithreaded performance uplift is over 40% for a few test matrices.
Paper Structure (7 sections, 8 equations, 4 figures, 2 tables)

This paper contains 7 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Sample matrix (left) in CSR storage (middle) and DA-CSR storage (right). Colorful dots represent non-zero entries, gray dots are zero. Whiskers represent (column) indices with respect to a reference line (dashed).
  • Figure 2: Sparsity patterns of the matrices GHS_psdef/ldoor (left) as well as Janna/Bump_2911 (right) from the SuiteSparse Matrix Collection SuiteSparse.
  • Figure 3: Approximate matrix-related traffic reduction when exchanging the data types used to store the matrix scalars (horizontally) or column indices (vertically) of (DA-) CSR storage, as well as dense storage (no indices required).
  • Figure 4: Relative performance and throughput of SpMV using the DA-CSR format with 16 column indices w.r.t. CSR using 32 column indices as the baseline (iso-scalar). The sizes of the L1d, L2, and L3 CPU caches are marked with vertical lines (left to right).

Theorems & Definitions (5)

  • Example 1
  • Example 2
  • Example 3
  • Remark 4: Non-Square Matrices
  • Remark 5: CSR using 16 column indices