Table of Contents
Fetching ...

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Abhinav Jangda, Mohit Yadav

TL;DR

FastKron introduces a Kron-Matmul algorithm that decouples from general linear-algebra primitives to realize Kron-Matmul-specific optimizations. It employs a sliced-multiplication scheme with shift-based shared-memory caching, kernel fusion, and autotuning, achieving up to 40.7× single-GPU and 7.85× multi-GPU speedups over state-of-the-art baselines. The work extends to distributed Kron-Matmul with a 2D GPU grid to minimize communication, and demonstrates strong gains on real-world Kron-Matmul workloads, including Gaussian-process training when integrated into GPyTorch. The contributions offer a practical, high-performance Kron-Matmul engine that accelerates applications across ML, scientific computing, and kernel-method pipelines, with publicly available code.

Abstract

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.

Fast Kronecker Matrix-Matrix Multiplication on GPUs

TL;DR

FastKron introduces a Kron-Matmul algorithm that decouples from general linear-algebra primitives to realize Kron-Matmul-specific optimizations. It employs a sliced-multiplication scheme with shift-based shared-memory caching, kernel fusion, and autotuning, achieving up to 40.7× single-GPU and 7.85× multi-GPU speedups over state-of-the-art baselines. The work extends to distributed Kron-Matmul with a 2D GPU grid to minimize communication, and demonstrates strong gains on real-world Kron-Matmul workloads, including Gaussian-process training when integrated into GPyTorch. The contributions offer a practical, high-performance Kron-Matmul engine that accelerates applications across ML, scientific computing, and kernel-method pipelines, with publicly available code.

Abstract

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.
Paper Structure (26 sections, 1 equation, 11 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 1 equation, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: First iteration of the shuffle algorithm for Kron-Matmul of $\textbf{X}_{2\times 4}$ and $\textbf{F}^{\text{1}}_{2\times 2} \otimes \textbf{F}^{\text{2}}_{2\times 2}$. Reshape transforms shape of a tensor to other shape. Transpose exchanges the elements of two dimensions of a multi-dimensional tensor.
  • Figure 2: First iteration of the FastKron Kron-Matmul algorithm of $\textbf{X}_{2\times 4}$ with $\textbf{F}^{\text{1}}_{2\times 2} \otimes \textbf{F}^{\text{2}}_{2\times 2}$. Elements of $\textbf{Y}^{\text{2}}$ with the same color are generated by a column of $\textbf{F}^{\text{2}}$ with the same color. The result of first iteration, $\textbf{Y}^{\text{2}}$, is same as in Figure \ref{['fig:kron-matmul']}.
  • Figure 3: FastKron's SlicedMultiplyKernel for $\textbf{X}_{\text{M}\times \text{K}}$ and $\textbf{F}_{\text{P}\times \text{Q}}$ to compute $\textbf{Y}_{\text{M}\times \frac{\text{K}\text{Q}}{\text{P}}}$. Shift* and Direct* transfers data from global/shared memory to shared memory/registers.
  • Figure 4: FastKron's tiling to sliced-multiply $\textbf{X}_{2\times 512}$ and $\textbf{F}_{8\times 8}$ to produce $\textbf{Y}_{2\times 512}$ with $\mathtt{T_M} = 1, \mathtt{T_K} = 512, \mathtt{T_Q} = 2, \mathtt{T_P} = 4, \mathtt{R_P} = 2, \mathtt{R_Q} = 2,\mathtt{R_K} = 2$. There are $\frac{512}{8} = 64$ slices for each $\textbf{X}$ row. The CUDA kernel is invoked with $\left\{\frac{2}{1}, \frac{512}{512}, \frac{8}{2}\right\}$ threadblocks. Xs and Fs are shared memory buffers. Xr, Fr, and Yr are register buffers.
  • Figure 5: FastKron's shift caching method. ShiftGToS caches from global to shared memory. ShiftSToR caches from shared memory to registers.
  • ...and 6 more figures