Fast Kronecker Matrix-Matrix Multiplication on GPUs
Abhinav Jangda, Mohit Yadav
TL;DR
FastKron introduces a Kron-Matmul algorithm that decouples from general linear-algebra primitives to realize Kron-Matmul-specific optimizations. It employs a sliced-multiplication scheme with shift-based shared-memory caching, kernel fusion, and autotuning, achieving up to 40.7× single-GPU and 7.85× multi-GPU speedups over state-of-the-art baselines. The work extends to distributed Kron-Matmul with a 2D GPU grid to minimize communication, and demonstrates strong gains on real-world Kron-Matmul workloads, including Gaussian-process training when integrated into GPyTorch. The contributions offer a practical, high-performance Kron-Matmul engine that accelerates applications across ML, scientific computing, and kernel-method pipelines, with publicly available code.
Abstract
Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.
