cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Zixuan Li; Mingxing Duan; Huizhang Luo; Wangdong Yang; Kenli Li; Keqin Li

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Zixuan Li, Mingxing Duan, Huizhang Luo, Wangdong Yang, Kenli Li, Keqin Li

TL;DR

This work tackles the challenge of decomposing large-scale sparse high-order tensors by adopting a non-convex SGD-based FastTuckerPlus that splits the optimization into two subproblems, enabling rapid convergence on sparse data. The authors implement cuFastTuckerPlus to exploit fine-grained GPU parallelism and Tensor Cores, with matrixization and memory-access optimizations to minimize overhead. Comprehensive experiments on real (Netflix, Yahoo!Music) and synthetic HHLST datasets show 3X–5X single-iteration speedups and substantial reductions in memory-access overhead, with Tensor Cores providing large acceleration, especially in core-update computations. The proposed approach demonstrates strong performance for high-order, high-dimensional sparse tensor completion and holds practical impact for scalable tensor analysis in data-rich domains.

Abstract

Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of $3X$ to $5X$ compared to state-of-the-art algorithms.

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

TL;DR

Abstract

compared to state-of-the-art algorithms.

Paper Structure (25 sections, 19 equations, 5 figures, 10 tables, 5 algorithms)

This paper contains 25 sections, 19 equations, 5 figures, 10 tables, 5 algorithms.

Introduction
Preliminaries
Notations
Basic Definitions
Problem
SGD-based Sparse FastTucker Decomposition Algorithm
Proposed Method
A Non-Convex SGD-based Sparse FastTucker Decomposition Algorithm
Matrixization
Complexity Analysis
cuFastTuckerPlus On GPU
Tensor Core
Matrix partition
Warp Parallelization
Block Parallelization
...and 10 more sections

Figures (5)

Figure 1: The convergence curves of cuFastTuckerPlus and other algorithms on Netflix and Yahoo!Music datasets.
Figure 2: The running time (in seconds) of cuFastTuckerPlus and other algorithms on synthesis datasets.
Figure 3: The memory access time (in seconds) for cuFastTuckerPlus and other algorithms on synthesis datasets.
Figure 4: The speedup achieved by cuFastTuckerPlus and other algorithms when utilizing Tensor Cores on the synthesis datasets.
Figure 5: The running time (in seconds) of cuFastTuckerPlus_CC and cuFastTuckerPlus in various strategies on the synthesis datasets.

Theorems & Definitions (5)

Definition 1: $n$-Mode Tensor-Matrix product
Definition 2: $R$ Kruskal Product
Definition 3: $R$ Dot Product
Definition 4: Hadamard Product
Definition 5: $R$ Hadamard Product

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

TL;DR

Abstract

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (5)