Table of Contents
Fetching ...

SparseAuto: An Auto-Scheduler for Sparse Tensor Computations Using Recursive Loop Nest Restructuring

Adhitha Dias, Logan Anderson, Kirshanthan Sundararajah, Artem Pelenitsyn, Milind Kulkarni

TL;DR

SparseAuto tackles the challenging problem of auto-scheduling sparse tensor contractions by introducing a recursive loop-nest restructuring framework that spans Linear Iteration Graphs (LIG) and Multi-level Branched Iteration Graphs (BIG). It combines a poset-based pruning strategy with SMT solver analysis to efficiently navigate the vast schedule space, balancing time and auxiliary memory; this yields schedules that outperform traditional pipelines like TACO on real-world tensors. The approach is validated through extensive kernels and datasets, showing orders-of-magnitude speedups in several cases, while also highlighting tradeoffs with memory usage and transpositions. The work advances sparse tensor compiler design by enabling multi-level branching and automated schedule selection, with potential for scheduling-as-a-service deployments in scientific workloads.

Abstract

Automated code generation and performance enhancements for sparse tensor algebra have become essential in many real-world applications, such as quantum computing, physical simulations, computational chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to generate asymptotically better schedules for complex sparse tensor expressions using kernel fission and fusion. We present generalized loop restructuring transformations to reduce asymptotic time complexity and memory footprint. Furthermore, we present an auto-scheduler that uses a partially ordered set (poset)-based cost model that uses both time and auxiliary memory complexities to prune the search space of schedules. In addition, we highlight the use of Satisfiability Module Theory (SMT) solvers in sparse auto-schedulers to approximate the Pareto frontier of better schedules to the smallest number of possible schedules, with user-defined constraints available at compile-time. Finally, we show that our auto-scheduler can select better-performing schedules and generate code for them. Our results show that the auto-scheduler provided schedules achieve orders-of-magnitude speedup compared to the code generated by the Tensor Algebra Compiler (TACO) for several computations on different real-world tensors.

SparseAuto: An Auto-Scheduler for Sparse Tensor Computations Using Recursive Loop Nest Restructuring

TL;DR

SparseAuto tackles the challenging problem of auto-scheduling sparse tensor contractions by introducing a recursive loop-nest restructuring framework that spans Linear Iteration Graphs (LIG) and Multi-level Branched Iteration Graphs (BIG). It combines a poset-based pruning strategy with SMT solver analysis to efficiently navigate the vast schedule space, balancing time and auxiliary memory; this yields schedules that outperform traditional pipelines like TACO on real-world tensors. The approach is validated through extensive kernels and datasets, showing orders-of-magnitude speedups in several cases, while also highlighting tradeoffs with memory usage and transpositions. The work advances sparse tensor compiler design by enabling multi-level branching and automated schedule selection, with potential for scheduling-as-a-service deployments in scientific workloads.

Abstract

Automated code generation and performance enhancements for sparse tensor algebra have become essential in many real-world applications, such as quantum computing, physical simulations, computational chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to generate asymptotically better schedules for complex sparse tensor expressions using kernel fission and fusion. We present generalized loop restructuring transformations to reduce asymptotic time complexity and memory footprint. Furthermore, we present an auto-scheduler that uses a partially ordered set (poset)-based cost model that uses both time and auxiliary memory complexities to prune the search space of schedules. In addition, we highlight the use of Satisfiability Module Theory (SMT) solvers in sparse auto-schedulers to approximate the Pareto frontier of better schedules to the smallest number of possible schedules, with user-defined constraints available at compile-time. Finally, we show that our auto-scheduler can select better-performing schedules and generate code for them. Our results show that the auto-scheduler provided schedules achieve orders-of-magnitude speedup compared to the code generated by the Tensor Algebra Compiler (TACO) for several computations on different real-world tensors.
Paper Structure (36 sections, 6 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 6 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: An example of an iteration graph for sparse matrix-matrix multiplication and corresponding code.
  • Figure 2: Different schedules of executing ${A_{lmn}} = \sum\nolimits_{ijk} {\sparse B_{ijk}}\cdot C_{il}\cdot D_{jm}\cdot E_{kn}$. Here, the code snippet \ref{['fig:a-tensor-contraction-kernel']} has a perfectly nested loop structure while all the other code snippets has a nested loop structure. Here, $j\_pos$ refers to the non-affine loop associated with the index $j$. The loop $j\_pos$ is non-affine because ${\sparse B_{ij}}$ is sparse. The code snippets \ref{['fig:b-tensor-contraction-kernel']} and \ref{['fig:e-tensor-contraction-kernel']} has one level of branching whereas the code snippets \ref{['fig:c-tensor-contraction-kernel']} and \ref{['fig:d-tensor-contraction-kernel']} has a branch nesting depth of two.
  • Figure 3: Placement of schedules based on asymptotic time vs. auxiliary memory complexities.
  • Figure 4: loopfuse transformation performed on $A_{lmn} = \sum\nolimits_{ijk} {\sparse B_{ijk}} C_{il} D_{jm} E_{kn}$. (a) TACO default kernel, (b) Fused kernel with $K$ extra memory, (c) Fused kernel with $JK$ extra memory, (d) \ref{['fig:tensor-contract-branched-jk-mem']} with reordered consumer branch, and (e) Multi-level nesting after fusing inner branch of \ref{['fig:tensor-contract-branched-jk-mem-reordered']}.
  • Figure 5: Transformation on the loop contraction
  • ...and 10 more figures