Table of Contents
Fetching ...

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Zixuan Li, Mingxing Duan, Huizhang Luo, Wangdong Yang, Kenli Li, Keqin Li

TL;DR

This work tackles the challenge of decomposing large-scale sparse high-order tensors by adopting a non-convex SGD-based FastTuckerPlus that splits the optimization into two subproblems, enabling rapid convergence on sparse data. The authors implement cuFastTuckerPlus to exploit fine-grained GPU parallelism and Tensor Cores, with matrixization and memory-access optimizations to minimize overhead. Comprehensive experiments on real (Netflix, Yahoo!Music) and synthetic HHLST datasets show 3X–5X single-iteration speedups and substantial reductions in memory-access overhead, with Tensor Cores providing large acceleration, especially in core-update computations. The proposed approach demonstrates strong performance for high-order, high-dimensional sparse tensor completion and holds practical impact for scalable tensor analysis in data-rich domains.

Abstract

Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of $3X$ to $5X$ compared to state-of-the-art algorithms.

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

TL;DR

This work tackles the challenge of decomposing large-scale sparse high-order tensors by adopting a non-convex SGD-based FastTuckerPlus that splits the optimization into two subproblems, enabling rapid convergence on sparse data. The authors implement cuFastTuckerPlus to exploit fine-grained GPU parallelism and Tensor Cores, with matrixization and memory-access optimizations to minimize overhead. Comprehensive experiments on real (Netflix, Yahoo!Music) and synthetic HHLST datasets show 3X–5X single-iteration speedups and substantial reductions in memory-access overhead, with Tensor Cores providing large acceleration, especially in core-update computations. The proposed approach demonstrates strong performance for high-order, high-dimensional sparse tensor completion and holds practical impact for scalable tensor analysis in data-rich domains.

Abstract

Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of to compared to state-of-the-art algorithms.
Paper Structure (25 sections, 19 equations, 5 figures, 10 tables, 5 algorithms)

This paper contains 25 sections, 19 equations, 5 figures, 10 tables, 5 algorithms.

Figures (5)

  • Figure 1: The convergence curves of cuFastTuckerPlus and other algorithms on Netflix and Yahoo!Music datasets.
  • Figure 2: The running time (in seconds) of cuFastTuckerPlus and other algorithms on synthesis datasets.
  • Figure 3: The memory access time (in seconds) for cuFastTuckerPlus and other algorithms on synthesis datasets.
  • Figure 4: The speedup achieved by cuFastTuckerPlus and other algorithms when utilizing Tensor Cores on the synthesis datasets.
  • Figure 5: The running time (in seconds) of cuFastTuckerPlus_CC and cuFastTuckerPlus in various strategies on the synthesis datasets.

Theorems & Definitions (5)

  • Definition 1: $n$-Mode Tensor-Matrix product
  • Definition 2: $R$ Kruskal Product
  • Definition 3: $R$ Dot Product
  • Definition 4: Hadamard Product
  • Definition 5: $R$ Hadamard Product