Table of Contents
Fetching ...

DGEMM on Integer Matrix Multiplication Unit

Hiroyuki Ootomo, Katsuhisa Ozaki, Rio Yokota

TL;DR

This work investigates using integer matrix multiplication units (IMMUs) to perform high-precision GEMM via the Ozaki scheme. It develops a DGEMM approach on IMMU (INT8 inputs with INT32 accumulators), analyzes theoretical advantages (denser mantissa per slice, fewer splits, smaller slice memory, fewer GEMMs), and implements an NVIDIA IMMU-based Ozaki variant. Through extensive experiments across GPUs, it shows competitive accuracy and significant throughput gains over FP16-based baselines, plus up to 4.33x speedups in quantum circuit simulations while preserving FP64 accuracy. The results highlight the practical potential of IMMU-based Ozaki GEMM for HPC workloads and provide a pathway to high-precision computing on hardware optimized for integer arithmetic.

Abstract

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.

DGEMM on Integer Matrix Multiplication Unit

TL;DR

This work investigates using integer matrix multiplication units (IMMUs) to perform high-precision GEMM via the Ozaki scheme. It develops a DGEMM approach on IMMU (INT8 inputs with INT32 accumulators), analyzes theoretical advantages (denser mantissa per slice, fewer splits, smaller slice memory, fewer GEMMs), and implements an NVIDIA IMMU-based Ozaki variant. Through extensive experiments across GPUs, it shows competitive accuracy and significant throughput gains over FP16-based baselines, plus up to 4.33x speedups in quantum circuit simulations while preserving FP64 accuracy. The results highlight the practical potential of IMMU-based Ozaki GEMM for HPC workloads and provide a pathway to high-precision computing on hardware optimized for integer arithmetic.

Abstract

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.
Paper Structure (31 sections, 7 equations, 10 figures, 3 tables, 4 algorithms)

This paper contains 31 sections, 7 equations, 10 figures, 3 tables, 4 algorithms.

Figures (10)

  • Figure 1: The basic computation of high precision matrix multiplication $\mathbf{C}\leftarrow\mathbf{A}\cdot\mathbf{B}$ algorithms on low precision computing unit.
  • Figure 2: The comparison of the elementwise-place splitting (Left) and shared-place splitting (Right) methods for splitting the vector $\mathbf{a}=a_1a_2\cdotsa_k^\top$ into several vectors $\mathbf{a}^{(\cdot)}$. $e_i$ is the exponent of $a_i$. The same applies to the vector $\mathbf{b}$.
  • Figure 3: The difference of the data storing ways in Algorithm \ref{['alg:ozaki-fp-splitting']} (Original algorithm using floating-point) and Algorithm \ref{['alg:ozaki-int-splitting']} (Using integer).
  • Figure 4: Comparing memory consumption and the number of GEMM operations among the matrix multiplication units. Upper left: The mantissa bit length one slice value keeps (BPS). Upper right: The number of splits to keep a specific mantissa space length (representation accuracy). The mantissa space length is $\text{BPS} \times \text{num\_split}$. Bottom left: The working memory size for storing the slices to keep a specific mantissa space length. Bottom right: The number of the matrix multiplications at line 6 in Algorithm \ref{['alg:ozaki-fp']} and \ref{['alg:ozaki-int']}.
  • Figure 5: The unit throughput comparison between INT8 (INT8-INT32) and FP16 Tensor Core (FP16-FP32).
  • ...and 5 more figures