SABLE: Staging Blocked Evaluation of Sparse Matrix Computations
Pratyush Das, Amirhossein Basareh, Adhitha Dias, Artem Pelenitsyn, Kirshanthan Sundararajah, Milind Kulkarni
TL;DR
SABLE tackles SpMV performance for matrices with structured sparsity by moving beyond purely dense or sparse representations. It introduces a staging-based inspector-executor framework that partitions CSR matrices into a Variable Block Row (VBR) representation and a hybrid VBR-C format, then generates region-specific code for high $\\delta$-dense blocks while dispatching low $\\delta$-dense blocks to a baseline library. The work provides a novel partitioner and a static classifier to identify matrices that benefit from this approach, and a multi-stage staging compiler that produces specialized C code for these dense regions. On real-world SuiteSparse matrices, SABLE achieves geometric mean speedups of $1.07$, $2.73$, and $1.9$ over Intel MKL, CSR5, and Partially-Strided Codelets in single-threaded runs, with amplified gains under parallel execution, demonstrating practical impact for structured sparse workloads. Overall, SABLE shows that leveraging dense substructures within sparse matrices and compiling region-specific code can yield substantial performance improvements for SpMV while maintaining flexibility through a hybrid storage scheme.
Abstract
Structured sparsity, like regions of non-zero elements in sparse matrices, can offer optimization opportunities often overlooked by existing solutions that treat matrices as entirely dense or sparse. Block-based approaches, such as BCSR, partially address this issue by choosing between fixed-size blocks which results in wasted computation on zero elements. On the other hand, variable-sized blocks introduce overheads due to variable loop bounds unknown at compile time. We present SABLE, a novel staging framework that achieves the best of both approaches by generating region-specific code tailored for variable-sized blocks. SABLE partitions the matrix to identify profitable blocks and specializes generated code for vectorization. We evaluate SABLE on the SpMV kernel using the SuiteSparse collection. SABLE achieves a geomean of $1.07$, $2.73$ and $1.9$ speedup over the state of the art systems: Intel MKL, CSR5 and Partially-Strided Codelets, respectively, single threaded and even more when parallelized.
