Table of Contents
Fetching ...

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

TL;DR

The paper tackles efficient high-precision matrix multiplication on architectures with low-precision tensor cores by optimizing the Ozaki scheme implemented in ozIMMU. It introduces three improvements: (i) splitting with rounding-to-nearest to reduce truncation error and potentially decrease the required number of slices, (ii) group-wise error-free accumulation to cut the costly FP64 accumulation, and (iii) a combined approach that integrates both strategies for better accuracy and speed. Through numerical experiments on RTX 4090 and GH200, the proposed methods achieve substantial reductions in accumulation overhead and competitive or superior accuracy across varying slice counts, yielding 1.2–1.6x speedups over the original ozIMMU. The work supports the potential for more efficient mixed-precision GEMM on next-generation architectures by reducing expensive high-precision steps while maintaining robustness against rounding errors. The results have practical implications for high-performance computing and ML workloads that rely on efficient, accurate matrix multiplications using Tensor Cores.

Abstract

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

TL;DR

The paper tackles efficient high-precision matrix multiplication on architectures with low-precision tensor cores by optimizing the Ozaki scheme implemented in ozIMMU. It introduces three improvements: (i) splitting with rounding-to-nearest to reduce truncation error and potentially decrease the required number of slices, (ii) group-wise error-free accumulation to cut the costly FP64 accumulation, and (iii) a combined approach that integrates both strategies for better accuracy and speed. Through numerical experiments on RTX 4090 and GH200, the proposed methods achieve substantial reductions in accumulation overhead and competitive or superior accuracy across varying slice counts, yielding 1.2–1.6x speedups over the original ozIMMU. The work supports the potential for more efficient mixed-precision GEMM on next-generation architectures by reducing expensive high-precision steps while maintaining robustness against rounding errors. The results have practical implications for high-performance computing and ML workloads that rely on efficient, accurate matrix multiplications using Tensor Cores.

Abstract

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.
Paper Structure (15 sections, 65 equations, 16 figures, 1 table, 8 algorithms)

This paper contains 15 sections, 65 equations, 16 figures, 1 table, 8 algorithms.

Figures (16)

  • Figure 1: Accuracy of ozIMMU. Matrix $A$ has entries $a_{ij} := (U_{ij}-0.5)\cdot \exp(\phi\cdot N_{ij})$, where $U_{ij} \in (0,1)$ are uniformly distributed and $N_{ij}$ are drawn from standard normal distribution for $1 \le i,j \le m$ and $m=n=p$. Matrix $B$ is composed similarly.
  • Figure 2: Time breakdown of ozIMMU on NVIDIA GeForce RTX 4090
  • Figure 3: Time breakdown of ozIMMU on NVIDIA GH200 Grace Hopper Superchip
  • Figure 4: Images of matrix multiplications in Ozaki scheme for $k=4$
  • Figure 5: Comparison of accuracy between ozIMMU and proposed methods. Matrix $A$ has entries $a_{ij} := (U_{ij}-0.5)\cdot \exp(\phi\cdot N_{ij})$, where $U_{ij} \in (0,1)$ are uniformly distributed random numbers and $N_{ij}$ are drawn from the standard normal distribution, for $1 \le i,j \le m$ and $m=n=p$. Matrix $B$ has a similar composition.
  • ...and 11 more figures