High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura
TL;DR
This paper tackles the challenge of achieving high-performance, high-precision matrix multiplication on architectures dominated by low-precision matrix engines. It introduces Ozaki Scheme II, a CRT-based emulation method that avoids input splitting and uses INT8 matrix engines to compute $C\approx AB$ with controlled accuracy, through careful constant selection, INT8 conversion, and a two-stage accumulation scheme. The approach yields substantial throughput and power-efficiency gains over native GEMMs and prior emulation techniques, particularly for large matrices on modern accelerators like the GH200, A100, and RTX 5080, and provides intermediate precision between FP32 and TF32 suitable for non-AI numerical tasks. The work also outlines accurate, scalable extensions to various floating-point formats, offering a practical pathway to bridge AI-optimized hardware with precision-critical computations. Overall, Ozaki Scheme II demonstrates that intelligent CRT-based emulation can significantly accelerate high-precision GEMMs while improving energy efficiency, with broad implications for numerical linear algebra on heterogeneous hardware.
Abstract
Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.
