cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
Zixuan Li, Mingxing Duan, Huizhang Luo, Wangdong Yang, Kenli Li, Keqin Li
TL;DR
This work tackles the challenge of decomposing large-scale sparse high-order tensors by adopting a non-convex SGD-based FastTuckerPlus that splits the optimization into two subproblems, enabling rapid convergence on sparse data. The authors implement cuFastTuckerPlus to exploit fine-grained GPU parallelism and Tensor Cores, with matrixization and memory-access optimizations to minimize overhead. Comprehensive experiments on real (Netflix, Yahoo!Music) and synthetic HHLST datasets show 3X–5X single-iteration speedups and substantial reductions in memory-access overhead, with Tensor Cores providing large acceleration, especially in core-update computations. The proposed approach demonstrates strong performance for high-order, high-dimensional sparse tensor completion and holds practical impact for scalable tensor analysis in data-rich domains.
Abstract
Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this paper, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that our method achieves a speedup of $3X$ to $5X$ compared to state-of-the-art algorithms.
