Table of Contents
Fetching ...

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

TL;DR

This paper tackles the challenge of achieving high-performance, high-precision matrix multiplication on architectures dominated by low-precision matrix engines. It introduces Ozaki Scheme II, a CRT-based emulation method that avoids input splitting and uses INT8 matrix engines to compute $C\approx AB$ with controlled accuracy, through careful constant selection, INT8 conversion, and a two-stage accumulation scheme. The approach yields substantial throughput and power-efficiency gains over native GEMMs and prior emulation techniques, particularly for large matrices on modern accelerators like the GH200, A100, and RTX 5080, and provides intermediate precision between FP32 and TF32 suitable for non-AI numerical tasks. The work also outlines accurate, scalable extensions to various floating-point formats, offering a practical pathway to bridge AI-optimized hardware with precision-critical computations. Overall, Ozaki Scheme II demonstrates that intelligent CRT-based emulation can significantly accelerate high-precision GEMMs while improving energy efficiency, with broad implications for numerical linear algebra on heterogeneous hardware.

Abstract

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

TL;DR

This paper tackles the challenge of achieving high-performance, high-precision matrix multiplication on architectures dominated by low-precision matrix engines. It introduces Ozaki Scheme II, a CRT-based emulation method that avoids input splitting and uses INT8 matrix engines to compute with controlled accuracy, through careful constant selection, INT8 conversion, and a two-stage accumulation scheme. The approach yields substantial throughput and power-efficiency gains over native GEMMs and prior emulation techniques, particularly for large matrices on modern accelerators like the GH200, A100, and RTX 5080, and provides intermediate precision between FP32 and TF32 suitable for non-AI numerical tasks. The work also outlines accurate, scalable extensions to various floating-point formats, offering a practical pathway to bridge AI-optimized hardware with precision-critical computations. Overall, Ozaki Scheme II demonstrates that intelligent CRT-based emulation can significantly accelerate high-precision GEMMs while improving energy efficiency, with broad implications for numerical linear algebra on heterogeneous hardware.

Abstract

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.

Paper Structure

This paper contains 13 sections, 1 theorem, 19 equations, 9 figures, 1 algorithm.

Key Result

Theorem 1

Let $x \in \mathbb{Z}$. Suppose that $p_1,\dots,p_N \in \mathbb{N}_{\ge 2}$ are pairwise coprime integers and $\mathcal{P} := \prod_{1 \le i \le N}{p_i}$. For $i=1,\dots,N$, define $q_i \in \mathbb{N}$ as modular multiplicative inverses of $\mathcal{P}/p_i$ (i.e., $\mathcal{P}/p_i \cdot q_i \equiv 1 Then, it holds that

Figures (9)

  • Figure 1: TFLOPS and TOPS of AMD and NVIDIA GPUs for dense data
  • Figure 2: Image of $s_{i1}$ and $s_{i2}$
  • Figure 3: Accuracy of DGEMM (top) and SGEMM (bottom) emulation for $m=n=1024$ on GH200. Solid lines represent results for $k=1024$, and dashed lines for $k=16384$.
  • Figure 4: Throughput performance of DGEMM emulation on A100 (top), GH200 (middle), and RTX 5080 (bottom)
  • Figure 5: Throughput performance of SGEMM emulation on A100 (top), GH200 (middle), and RTX 5080 (bottom)
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1: Chinese Remainder Theorem