Table of Contents
Fetching ...

Compilation of Modular and General Sparse Workspaces

Genghan Zhang, Olivia Hsu, Fredrik Kjolstad

TL;DR

This work addresses the sparse scattering bottleneck in sparse tensor algebra by introducing sparse workspaces as efficient adapters between compute code and sparse results. It proposes a modular, template-based Insert-Sort-Merge (ISM) framework that can express and instantiate a wide range of sparse workspace policies, integrated into the TACO compiler to generate sequential code competitive with hand-optimized libraries. The key contributions include an analysis framework for sparse scattering, the ISM template, automatic workspace insertion, and concrete policy implementations (data structures, sorting, and optimizations) that yield up to $27.12×$ speedups over dense workspaces and up to $3.6×$ memory footprint reductions on average. The results demonstrate that sparse workspaces offer substantial performance and memory advantages for higher-order tensor computations, while dense workspaces remain favorable in some cases; the approach provides a scalable path toward parallel sparse-workspace code generation and broad applicability across tensor expressions.

Abstract

Recent years have seen considerable work on compiling sparse tensor algebra expressions. This paper addresses a shortcoming in that work, namely how to generate efficient code (in time and space) that scatters values into a sparse result tensor. We address this shortcoming through a compiler design that generates code that uses sparse intermediate tensors (sparse workspaces) as efficient adapters between compute code that scatters and result tensors that do not support random insertion. Our compiler automatically detects sparse scattering behavior in tensor expressions and inserts necessary intermediate workspace tensors. We present an algorithm template for workspace insertion that is the backbone of our code generation algorithm. Our algorithm template is modular by design, supporting sparse workspaces that span multiple user-defined implementations. Our evaluation shows that sparse workspaces can be up to 27.12$\times$ faster than the dense workspaces of prior work. On the other hand, dense workspaces can be up to 7.58$\times$ faster than the sparse workspaces generated by our compiler in other situations, which motivates our compiler design that supports both. Our compiler produces sequential code that is competitive with hand-optimized linear and tensor algebra libraries on the expressions they support, but that generalizes to any other expression. Sparse workspaces are also more memory efficient than dense workspaces as they compress away zeros. This compression can asymptotically decrease memory usage, enabling tensor computations on data that would otherwise run out of memory.

Compilation of Modular and General Sparse Workspaces

TL;DR

This work addresses the sparse scattering bottleneck in sparse tensor algebra by introducing sparse workspaces as efficient adapters between compute code and sparse results. It proposes a modular, template-based Insert-Sort-Merge (ISM) framework that can express and instantiate a wide range of sparse workspace policies, integrated into the TACO compiler to generate sequential code competitive with hand-optimized libraries. The key contributions include an analysis framework for sparse scattering, the ISM template, automatic workspace insertion, and concrete policy implementations (data structures, sorting, and optimizations) that yield up to speedups over dense workspaces and up to memory footprint reductions on average. The results demonstrate that sparse workspaces offer substantial performance and memory advantages for higher-order tensor computations, while dense workspaces remain favorable in some cases; the approach provides a scalable path toward parallel sparse-workspace code generation and broad applicability across tensor expressions.

Abstract

Recent years have seen considerable work on compiling sparse tensor algebra expressions. This paper addresses a shortcoming in that work, namely how to generate efficient code (in time and space) that scatters values into a sparse result tensor. We address this shortcoming through a compiler design that generates code that uses sparse intermediate tensors (sparse workspaces) as efficient adapters between compute code that scatters and result tensors that do not support random insertion. Our compiler automatically detects sparse scattering behavior in tensor expressions and inserts necessary intermediate workspace tensors. We present an algorithm template for workspace insertion that is the backbone of our code generation algorithm. Our algorithm template is modular by design, supporting sparse workspaces that span multiple user-defined implementations. Our evaluation shows that sparse workspaces can be up to 27.12 faster than the dense workspaces of prior work. On the other hand, dense workspaces can be up to 7.58 faster than the sparse workspaces generated by our compiler in other situations, which motivates our compiler design that supports both. Our compiler produces sequential code that is competitive with hand-optimized linear and tensor algebra libraries on the expressions they support, but that generalizes to any other expression. Sparse workspaces are also more memory efficient than dense workspaces as they compress away zeros. This compression can asymptotically decrease memory usage, enabling tensor computations on data that would otherwise run out of memory.
Paper Structure (45 sections, 5 equations, 28 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 5 equations, 28 figures, 5 tables, 1 algorithm.

Figures (28)

  • Figure 1: A second-order dense workspace for outer-product matrix multiplication (SpGEMM). The above for-loop pseudo codes show the sparse iterations that generate tensor components. Red numbers represent newly generated coordinates and values. The workspace must support three behaviors: deduplicating (a $\rightarrow$ b), appending (b $\rightarrow$ c), and inserting (c $\rightarrow$ d). The computation utilizes a workspace since the final compressed data structures do not support insertion. Furthermore, the result storage should be compressed for memory efficiency since the final output has only four values.
  • Figure 2: A second-order sparse workspace for the outer-product SpGEMM in \ref{['fig:spwsintro']}. The colored nonzero components of the input tensors show a correspondence to their respective input tensor components. In a second-order workspace, each (I,J,val) tensor component is indexed by two variables I and J.
  • Figure 3: A simplified concrete index notation (CIN) syntax with no scheduling relationships.
  • Figure 4: Two example tensor level formats for compressed sparse row (CSR) and compressed sparse column (CSC).
  • Figure 5: Sparse tensor algebra expressions classified by computation and ordering. Blue and green arrows show the loop order and red lines show the result assembly order. Tensors' index variables encode access order.
  • ...and 23 more figures