Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

Sasindu Wijeratne; Rajgopal Kannan; Viktor Prasanna

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna

TL;DR

This work tackles the spMTTKRP bottleneck in sparse CPD by designing a GPU-focused algorithm that eliminates global atomic synchronization across thread blocks, avoids passing intermediate results through global memory, and achieves tight load balancing via a hypergraph-informed tensor partitioning scheme. A key innovation is dynamic tensor remapping based on a modified FLYCOO tensor format, which enables mode-agnostic optimizations and per-element independence during remapping, removing the need for cross-block atomics. The method maps computations to GPU thread blocks with configurations that maximize SM throughput and L1 cache reuse, delivering substantial speedups over state-of-the-art baselines and enabling tensors with more than four modes. Experimental results demonstrate 1.5x–2.0x speedups over mode-specific implementations and up to 21.7x over mode-agnostic approaches, highlighting practical impact for large-scale tensor analyses on GPUs.

Abstract

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the bottleneck kernel of sparse tensor decomposition. In this work, we propose a GPU-based algorithm design to address the key challenges in accelerating spMTTKRP computation, including (1) eliminating global atomic operations across GPU thread blocks, (2) avoiding the intermediate values being communicated between GPU thread blocks and GPU global memory, and (3) ensuring a balanced distribution of workloads across GPU thread blocks. Our approach also supports dynamic tensor remapping, enabling the above optimizations in all the modes of the input tensor. Our approach achieves a geometric mean speedup of 1.5x, 2.0x, and 21.7x in total execution time across widely used datasets compared with the state-of-the-art GPU implementations. Our work is the only GPU implementation that can support tensors with modes greater than 4 since the state-of-the-art works have implementation constraints for tensors with a large number of modes.

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

TL;DR

Abstract

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

Authors

TL;DR

Abstract

Table of Contents

Figures (10)