Table of Contents
Fetching ...

A Parallel Scan Algorithm in the Tensor Core Unit Model

Anastasios Zouzias, William F. McColl

TL;DR

A parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation that performs multiplications of square matrices of size s and has depth at most $2\lfloor \log_s (n)$ for inputs of size n.

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size $s$ is a basic operation. In the $(s^2, \ell)$-TCU model, we show that for inputs of size $n$, the algorithm has depth at most $2\lfloor \log_s (n)\rfloor$ and runs in $O(n(1 + \ell /s^2)/p + (s^2 + \ell) \log_s (n))$ time assuming $p$ tensor core units. Equivalently, the algorithm performs $O(n/s^2)$ multiplications of square matrices of size s.

A Parallel Scan Algorithm in the Tensor Core Unit Model

TL;DR

A parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation that performs multiplications of square matrices of size s and has depth at most for inputs of size n.

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size is a basic operation. In the -TCU model, we show that for inputs of size , the algorithm has depth at most and runs in time assuming tensor core units. Equivalently, the algorithm performs multiplications of square matrices of size s.

Paper Structure

This paper contains 14 sections, 3 theorems, 6 equations, 3 figures, 1 table, 2 algorithms.

Key Result

lemma 1

Fix an integer $s\geq 2$. Let $\bm{x}$ be a vector of size $n=s^k$ for some $k$. Algorithm alg:tcu_scan has depth $2k-1$ in the TCU model and performs at most $\lceil \frac{2n}{s(s-1)}\rceil +2k-2$ matrix multiplications. Moreover, the number of scalar binary additions executed by Algorithm alg:tcu_

Figures (3)

  • Figure 1: Examples of Algorithm \ref{['alg:tcu_scan']} for input $\bm{x}=[1,2,\dots , 16]$.
  • Figure 2: Execution of BatchMatMul$(\bm{y}=[1;2;\dots ;16], \bm{A}_2=\bm{L}_2)$.
  • Figure 3: Execution diagram of the general case leveraging Algorithm \ref{['alg:tcu_scan']} as a building block. The diagram demonstrates that after the up-sweep phase of the first largest chunks of size $s^{k_1}$, the prefix sum computation of the maximum values of each chunk (excluding the first one) can be interleaved with the down-sweep computation of the largest chunks.

Theorems & Definitions (5)

  • lemma 1
  • proof
  • theorem 1
  • proof
  • corollary 1