Table of Contents
Fetching ...

Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation

Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi

TL;DR

This work tackles the challenge of efficiently analyzing high-dimensional sparse tensors with irregular shapes by introducing Adaptive Linearized Tensor Order (ALTO), a mode-agnostic, compact encoding that enables a single tensor copy for all TD operations. It presents ALTO-based parallel algorithms for CP-ALS and CP-APR that adaptively balance workload, manage memory, and resolve conflicts to maximize data reuse and minimize synchronization. Empirical results on Intel ICX/Sapphire Rapids systems show substantial speedups over state-of-the-art mode-agnostic, mode-specific, and memoization-based formats, along with significant storage reductions. The combination of adaptive traversal, memory management, and a unified linearized representation yields scalable performance across diverse sparsity patterns and tensor sizes, with practical implications for large-scale sparse tensor analytics.

Abstract

High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures or along a particular dimension/mode is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (ALTO), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, ALTO constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of ALTO, we propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on the latest Intel Xeon Scalable processors and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, ALTO outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, ALTO achieves 5.1X geometric mean speedup at a fraction (25%) of their storage costs.

Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation

TL;DR

This work tackles the challenge of efficiently analyzing high-dimensional sparse tensors with irregular shapes by introducing Adaptive Linearized Tensor Order (ALTO), a mode-agnostic, compact encoding that enables a single tensor copy for all TD operations. It presents ALTO-based parallel algorithms for CP-ALS and CP-APR that adaptively balance workload, manage memory, and resolve conflicts to maximize data reuse and minimize synchronization. Empirical results on Intel ICX/Sapphire Rapids systems show substantial speedups over state-of-the-art mode-agnostic, mode-specific, and memoization-based formats, along with significant storage reductions. The combination of adaptive traversal, memory management, and a unified linearized representation yields scalable performance across diverse sparsity patterns and tensor sizes, with practical implications for large-scale sparse tensor analytics.

Abstract

High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures or along a particular dimension/mode is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (ALTO), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, ALTO constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of ALTO, we propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on the latest Intel Xeon Scalable processors and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, ALTO outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, ALTO achieves 5.1X geometric mean speedup at a fraction (25%) of their storage costs.
Paper Structure (29 sections, 3 equations, 14 figures, 1 table, 5 algorithms)

This paper contains 29 sections, 3 equations, 14 figures, 1 table, 5 algorithms.

Figures (14)

  • Figure 1: A box plot of the data (nonzero elements) distribution across the multi-dimensional blocks (subspaces) of the hierarchical coordinate storage li2018hicoo. The multi-dimensional subspace size is 128$^{N}$, where $N$ is the number of dimensions (modes), as per prior work sun2020sptfs. The sparse tensors are sorted in an increasing order of their number of nonzero elements. Sparsity is extremely high for many tensors (e.g., nell-1, amazon, and reddit) and vary greatly across tensors.
  • Figure 2: CPD of a mode-$3$ tensor $\mathcal{X}$. There are $R$ rank-one tensors that are formed by the outer-product between three vectors $\mathbf a_{r}^{(1)}$, $\mathbf a_{r}^{(2)}$, and $\mathbf a_{r}^{(3)}$, where $r \in \{1,2,\ldots,R\}$. The vectors along the same mode are often grouped together as the columns of a factor matrix. For example, the vectors $\mathbf a_{1}^{(1)}$, $\mathbf a_{2}^{(1)}$, …, $\mathbf a_{R}^{(1)}$ are the columns of the mode-$1$ factor matrix $\mathbf A^{(1)}$.
  • Figure 3: Different sparse tensor representations of a $4\times 8\times 2$ tensors with six nonzero elements.
  • Figure 4: An example of the ALTO sparse encoding and representation for the three-dimensional tensor in Figure \ref{['fig:format-comparison']}(a).
  • Figure 5: For the example in Figure \ref{['fig:alto_encoding']}, ALTO generates a non-fractal, yet more compact encoding compared to traditional space-filling curves, such as the Z-Morton order.
  • ...and 9 more figures