Compilation of Modular and General Sparse Workspaces
Genghan Zhang, Olivia Hsu, Fredrik Kjolstad
TL;DR
This work addresses the sparse scattering bottleneck in sparse tensor algebra by introducing sparse workspaces as efficient adapters between compute code and sparse results. It proposes a modular, template-based Insert-Sort-Merge (ISM) framework that can express and instantiate a wide range of sparse workspace policies, integrated into the TACO compiler to generate sequential code competitive with hand-optimized libraries. The key contributions include an analysis framework for sparse scattering, the ISM template, automatic workspace insertion, and concrete policy implementations (data structures, sorting, and optimizations) that yield up to $27.12×$ speedups over dense workspaces and up to $3.6×$ memory footprint reductions on average. The results demonstrate that sparse workspaces offer substantial performance and memory advantages for higher-order tensor computations, while dense workspaces remain favorable in some cases; the approach provides a scalable path toward parallel sparse-workspace code generation and broad applicability across tensor expressions.
Abstract
Recent years have seen considerable work on compiling sparse tensor algebra expressions. This paper addresses a shortcoming in that work, namely how to generate efficient code (in time and space) that scatters values into a sparse result tensor. We address this shortcoming through a compiler design that generates code that uses sparse intermediate tensors (sparse workspaces) as efficient adapters between compute code that scatters and result tensors that do not support random insertion. Our compiler automatically detects sparse scattering behavior in tensor expressions and inserts necessary intermediate workspace tensors. We present an algorithm template for workspace insertion that is the backbone of our code generation algorithm. Our algorithm template is modular by design, supporting sparse workspaces that span multiple user-defined implementations. Our evaluation shows that sparse workspaces can be up to 27.12$\times$ faster than the dense workspaces of prior work. On the other hand, dense workspaces can be up to 7.58$\times$ faster than the sparse workspaces generated by our compiler in other situations, which motivates our compiler design that supports both. Our compiler produces sequential code that is competitive with hand-optimized linear and tensor algebra libraries on the expressions they support, but that generalizes to any other expression. Sparse workspaces are also more memory efficient than dense workspaces as they compress away zeros. This compression can asymptotically decrease memory usage, enabling tensor computations on data that would otherwise run out of memory.
