Table of Contents
Fetching ...

Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor Network

Raghavendra Kanakagiri, Edgar Solomonik

TL;DR

This work tackles the bottleneck of contracting a sparse input tensor with a tensor network of dense factors by automatically discovering minimum-cost loop nests for SpTTN kernels. It introduces SpTTN-Cyclops, a runtime that (i) enumerates contraction paths and loop orders, (ii) uses dynamic programming with tree-separable cost functions to prune the search, and (iii) executes optimized loop nests in distributed environments while leveraging BLAS where possible. The key contributions are a formal framework for loop-nest trees/forests, tractable DP-based optimization with concrete cost metrics (max buffer size and cache misses), and a practical runtime that outperforms generalized libraries and matches specialized codes on real-world workloads. The framework enables scalable, automatic optimization of a broad class of sparse-tensor contraction kernels relevant to tensor decomposition and completion, with strong empirical performance and avenues for future extensions such as partial fusion and broader data-distribution strategies.

Abstract

Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.

Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor Network

TL;DR

This work tackles the bottleneck of contracting a sparse input tensor with a tensor network of dense factors by automatically discovering minimum-cost loop nests for SpTTN kernels. It introduces SpTTN-Cyclops, a runtime that (i) enumerates contraction paths and loop orders, (ii) uses dynamic programming with tree-separable cost functions to prune the search, and (iii) executes optimized loop nests in distributed environments while leveraging BLAS where possible. The key contributions are a formal framework for loop-nest trees/forests, tractable DP-based optimization with concrete cost metrics (max buffer size and cache misses), and a practical runtime that outperforms generalized libraries and matches specialized codes on real-world workloads. The framework enables scalable, automatic optimization of a broad class of sparse-tensor contraction kernels relevant to tensor decomposition and completion, with strong empirical performance and avenues for future extensions such as partial fusion and broader data-distribution strategies.

Abstract

Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.
Paper Structure (29 sections, 1 theorem, 8 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 29 sections, 1 theorem, 8 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.7

Consider a contraction path $(T,L)$ and a tree-separable cost function $f$ specified by $\varphi_{T,L}$ and $\oplus$. ORDER($T, L, \varphi_{T,L,r}$) (Algorithm alg:algo_recursive) returns two loop orders, $A$ and $B$, for $(T,L)$, so that $A$ has minimal cost ($f_{\varphi,\oplus}(T,L,A)$) among all

Figures (7)

  • Figure 1: Graphs illustrating loop nests for computing an order 3 TTMc kernel. Sparse loops are shown as dotted vertices.
  • Figure 2: An order 4 TTMc kernel $\mathcal{S}_{irst} = \mathcal{T}_{ijkl}\cdot \mathcal{U}_{jr}\cdot \mathcal{V}_{ks}\cdot \mathcal{W}_{lt}$, where (a) represents the contraction path tree ($T$) with $L = ((ijkl,lt,ijkt), (ijkt,ks,ijst), (ijst,jr,irst))$, and (b) shows the path graphs corresponding to the contraction path terms, fused to obtain a fully fused loop nest tree.
  • Figure 3: Loop nest for an order 4 TTMc kernel. Loop $r$ of contraction is not via recursion but is generated as a loop by metaprogramming. Contractions and are offloaded to BLAS-1, and contraction is offloaded to a BLAS-2 kernel.
  • Figure 4: Single thread performance of MTTKRP with $R=64$.
  • Figure 5: Strong scaling of kernels TTMc, MTTKRP and TTTP. The sparse tensor dimensions are identical across all modes. TTMc and MTTKRP are computed on order 3 and order 4 tensors of $0.1\%$ sparsity. Their dimensions are set to $8192$ and $1024$, respectively. TTTP is computed on order 3 tensors. $R=32$.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 3.1: Contraction Path
  • Definition 3.2: Loop Order
  • Definition 4.1: Peeling of Loop Order
  • Definition 4.2: Fully-fused Loop Nest Forest
  • Definition 4.3: Peeling of Fully-fused Loop Nest Tree
  • Definition 4.4: Tree-separable Cost Function
  • Definition 4.5: Cost Function for Maximum Buffer Dimension
  • Definition 4.6: Cost Function for Total Number of Cache Misses
  • Theorem 4.7: Proof of Correctness of Algorithm\ref{['alg:algo_recursive']}