Table of Contents
Fetching ...

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Yuki Uchino, Qianxiang Ma, Toshiyuki Imamura, Katsuhisa Ozaki, Patrick Lars Gutsche

TL;DR

This work extends the Ozaki-II CRT-based emulation framework to complex matrix multiplication (CGEMM/ZGEMM) on INT8 engines, leveraging Karatsuba-based formulations and two scaling-vector strategies to achieve high throughput with competitive accuracy. It introduces a portable GEMM emulation library (CUDA/HIP) and provides performance models to predict runtime, demonstrating substantial speedups over vendor GEMM implementations on several GPUs, notably the B200 and RTX 5080, while also analyzing limitations such as memory overhead. The results indicate that, for many large-scale problems, the proposed approach can serve as a default emulation algorithm where high precision is desired but hardware is constrained to low-precision units. The work also contrasts single- and double-precision complex emulation with real-valued variants, offering insights into architectural balance and scalability across diverse accelerators.

Abstract

Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision matrix multiplication using low-precision hardware has attracted significant interest in the high-performance computing community. Ozaki, Uchino, and Imamura proposed the Ozaki-II scheme as a general framework for emulating matrix multiplication. Building on this framework, Uchino, Ozaki, and Imamura developed high-performance and power-efficient techniques for emulating single- and double-precision real matrix multiplication on INT8 matrix engines. Extending this line of research, the present study proposes high-performance emulation methods for single- and double-precision complex matrix multiplication on INT8 matrix engines, based on the Ozaki-II scheme. On an NVIDIA B200 GPU, the proposed methods achieve 4.4--6.5x and 4.0--5.6x speedups over the native single- and double-precision complex matrix multiplication routines from cuBLAS, respectively, for sufficiently large problem sizes. When lower accuracy than that of the standard routines is acceptable, the proposed methods can operate at even higher speed. Conversely, with only a modest increase in computation time, they can deliver higher accuracy than that of the standard routines. These properties suggest that the proposed approach has the potential to serve as a default algorithm across a wide range of applications.

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

TL;DR

This work extends the Ozaki-II CRT-based emulation framework to complex matrix multiplication (CGEMM/ZGEMM) on INT8 engines, leveraging Karatsuba-based formulations and two scaling-vector strategies to achieve high throughput with competitive accuracy. It introduces a portable GEMM emulation library (CUDA/HIP) and provides performance models to predict runtime, demonstrating substantial speedups over vendor GEMM implementations on several GPUs, notably the B200 and RTX 5080, while also analyzing limitations such as memory overhead. The results indicate that, for many large-scale problems, the proposed approach can serve as a default emulation algorithm where high precision is desired but hardware is constrained to low-precision units. The work also contrasts single- and double-precision complex emulation with real-valued variants, offering insights into architectural balance and scalability across diverse accelerators.

Abstract

Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision matrix multiplication using low-precision hardware has attracted significant interest in the high-performance computing community. Ozaki, Uchino, and Imamura proposed the Ozaki-II scheme as a general framework for emulating matrix multiplication. Building on this framework, Uchino, Ozaki, and Imamura developed high-performance and power-efficient techniques for emulating single- and double-precision real matrix multiplication on INT8 matrix engines. Extending this line of research, the present study proposes high-performance emulation methods for single- and double-precision complex matrix multiplication on INT8 matrix engines, based on the Ozaki-II scheme. On an NVIDIA B200 GPU, the proposed methods achieve 4.4--6.5x and 4.0--5.6x speedups over the native single- and double-precision complex matrix multiplication routines from cuBLAS, respectively, for sufficiently large problem sizes. When lower accuracy than that of the standard routines is acceptable, the proposed methods can operate at even higher speed. Conversely, with only a modest increase in computation time, they can deliver higher accuracy than that of the standard routines. These properties suggest that the proposed approach has the potential to serve as a default algorithm across a wide range of applications.

Paper Structure

This paper contains 16 sections, 27 equations, 15 figures, 1 table, 1 algorithm.

Figures (15)

  • Figure 1: Performance comparison of four INT8-based GEMM strategies on NVIDIA H100 NVL (CUDA Toolkit 12.8.61). The first two methods use a single INT8 matrix multiplication with dimensions $(2h, h, 2h)$ and $(h, 2h, 2h)$, respectively. The third method performs three INT8 multiplications of size $(h, h, h)$. The fourth method applies the same three-multiplication scheme but introduces blocking along the $n$ dimension.
  • Figure 2: Performance model heatmaps for single-precision complex matrix multiplication emulation. The left panel corresponds to the fast mode and the right panel corresponds to the accurate mode. The horizontal axis denotes the achievable memory bandwidth and the vertical axis denotes the achievable INT8 GEMM throughput. The problem size is fixed at $m=n=k=16384$ and the correction term is set to $c=6$, equal to the number of moduli used in the emulation. The color scale indicates the predicted throughput (in TFLOPS) of the proposed emulation.
  • Figure 3: Performance model heatmaps for double-precision complex matrix multiplication emulation. The left panel corresponds to the fast mode and the right panel corresponds to the accurate mode. The horizontal axis denotes the achievable memory bandwidth and the vertical axis denotes the achievable INT8 GEMM throughput. The problem size is fixed at $m=n=k=16384$ and the correction term is set to $c=13$, equal to the number of moduli used in the emulation. The color scale indicates the predicted throughput (in TFLOPS) of the proposed emulation.
  • Figure 4: Maximum relative error of single-precision complex matrix multiplication on NVIDIA GH200 Grace Hopper Superchip (CUDA Toolkit 13.0.88, gcc 11.5.0) for $m=n=1024$ and $k=16384$.
  • Figure 5: Maximum relative error of double-precision complex matrix multiplication on NVIDIA GH200 Grace Hopper Superchip (CUDA Toolkit 13.0.88, gcc 11.5.0) for $m=n=1024$ and $k=16384$.
  • ...and 10 more figures