Table of Contents
Fetching ...

SparseZipper: Enhancing Matrix Extensions to Accelerate SpGEMM on CPUs

Tuan Ta, Joshua Randall, Christopher Batten

TL;DR

SparseZipper targets the inefficiency of dense-oriented matrix extensions when performing sparse-sparse SpGEMM on CPUs with unstructured sparsity. It minimally extends an existing matrix ISA and baseline 16×16 systolic array by introducing merge-based stream sorting and merging instructions that operate on key-value streams generated by the Gustavson row-wise SpGEMM algorithm, reusing existing matrix and vector registers. The approach yields substantial speedups (average 5.98× over scalar hash-based SpGEMM and 2.61× over a vectorized SpGEMM) with a modest 12.7% area increase for the 16×16 SparseZipper unit, suggesting favorable hardware overhead for broader SOC integration. These results demonstrate that sparse-sparse GEMM can be efficiently accelerated on CPUs through targeted ISA and micro-architectural enhancements, enabling faster workloads in graph analytics, scientific computing, and related domains.

Abstract

The importance of general matrix multiplication (GEMM) is motivating new instruction set extensions for multiplying dense matrices in almost all contemporary ISAs, and these extensions are often implemented using high-performance systolic arrays. However, matrices in emerging workloads are not always dense, and sparse matrices where the vast majority of values are zeros are becoming more common. Existing matrix extensions and micro-architectures cannot efficiently process highly sparse matrices due to two reasons: (1) wasted work when one or both input values are zero; and (2) incompatibility with sparse matrix formats. This work proposes SparseZipper that minimally modifies existing matrix extensions and systolic-array-based micro-architectures specialized for dense-dense GEMM to accelerate sparse-sparse GEMM operating on highly sparse matrices with unstructured sparsity structures. Our performance evaluation shows SparseZipper achieves 5.98x and 2.61x speedup over a scalar hash-based implementation of SpGEMM and a state-of-the-art vectorized SpGEMM version, respectively. Our component-level area evaluation shows SparseZipper increases the area of a baseline 16x16 systolic array by only 12.7% resulting in an area overhead for an entire system-on-chip of just a few percent.

SparseZipper: Enhancing Matrix Extensions to Accelerate SpGEMM on CPUs

TL;DR

SparseZipper targets the inefficiency of dense-oriented matrix extensions when performing sparse-sparse SpGEMM on CPUs with unstructured sparsity. It minimally extends an existing matrix ISA and baseline 16×16 systolic array by introducing merge-based stream sorting and merging instructions that operate on key-value streams generated by the Gustavson row-wise SpGEMM algorithm, reusing existing matrix and vector registers. The approach yields substantial speedups (average 5.98× over scalar hash-based SpGEMM and 2.61× over a vectorized SpGEMM) with a modest 12.7% area increase for the 16×16 SparseZipper unit, suggesting favorable hardware overhead for broader SOC integration. These results demonstrate that sparse-sparse GEMM can be efficiently accelerated on CPUs through targeted ISA and micro-architectural enhancements, enabling faster workloads in graph analytics, scientific computing, and related domains.

Abstract

The importance of general matrix multiplication (GEMM) is motivating new instruction set extensions for multiplying dense matrices in almost all contemporary ISAs, and these extensions are often implemented using high-performance systolic arrays. However, matrices in emerging workloads are not always dense, and sparse matrices where the vast majority of values are zeros are becoming more common. Existing matrix extensions and micro-architectures cannot efficiently process highly sparse matrices due to two reasons: (1) wasted work when one or both input values are zero; and (2) incompatibility with sparse matrix formats. This work proposes SparseZipper that minimally modifies existing matrix extensions and systolic-array-based micro-architectures specialized for dense-dense GEMM to accelerate sparse-sparse GEMM operating on highly sparse matrices with unstructured sparsity structures. Our performance evaluation shows SparseZipper achieves 5.98x and 2.61x speedup over a scalar hash-based implementation of SpGEMM and a state-of-the-art vectorized SpGEMM version, respectively. Our component-level area evaluation shows SparseZipper increases the area of a baseline 16x16 systolic array by only 12.7% resulting in an area overhead for an entire system-on-chip of just a few percent.

Paper Structure

This paper contains 23 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Multiple Steps to Compute One Output Row -- Each tuple includes a column index (key) and a value.
  • Figure 2: Merging Two Sorted Key-Value Partitions in Chunks -- e.g., merging the last two key-value partitions in Figure \ref{['fig-spz-stream-merge-tree']}.
  • Figure 3: Mapping Between Key-Value Streams and Matrix Registers -- Only chunks of key-value tuples with dashed borders are held in the two matrix registers.
  • Figure 4: Examples of Using SparseZipper Instructions to Sort and Merge Key-Value Streams -- a{0..3} = scalar registers; v{0..9} = vector registers; tr{0..3} = matrix registers.
  • Figure 5: Cycle-by-Cycle Systolic Execution of mssortk in a 3$\times$3 Systolic Array for Two Unsorted Lists of Keys -- PE states: F = forward, X = switch, C = combine; W_IC = west input counter; N_IC = north input counter; E_OC = east output counter; S_OC = south output counter; d = duplicate key that is excluded; x = unmergeable key; Counters in red indicate they are being updated. PEs in gray are inactive. PEs in blue are merging keys. PEs in purple are compressing valid output keys. Keys in red come from the north input. Keys in green come from the west input. Keys in west and east sides are ordered from bottom to top. Keys in north and south sides are ordered from left to right.
  • ...and 6 more figures