Table of Contents
Fetching ...

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

Angelika Schwarz, Anton Anders, Cole Brower, Harun Bayraktar, John Gunnels, Kate Clark, RuQing G. Xu, Samuel Rodriguez, Sebastien Cayrols, Paweł Tabaszewski, Victor Podlozhnyuk

TL;DR

The paper addresses the need for reliable $FP64$ accuracy on low-precision Tensor Cores by introducing a GPU-resident framework that combines Exponent Span Capacity (ESC) estimation with Automatic Dynamic Precision (ADP). It introduces unsigned slice encoding to improve mantissa utilization and demonstrates that ADP can guarantee $FP64$-level accuracy with less than $10\%$ overhead, achieving up to $2.3\times$ and $13.2\times$ speedups over native $FP64$ on different NVIDIA architectures and enabling production-grade use via cuBLAS/cuSOLVER integration. The core contributions are the ESC estimator, the ADP workflow with safety guardrails and fallback, and the open-source release for reproducibility and adoption. The results show that low-precision accelerators can be harnessed for high-fidelity scientific computing workloads without sacrificing performance, paving the way for safer integration of FP64 emulation in production HPC libraries.

Abstract

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput than traditional FP64 pipelines. This hardware shift has sparked a new line of algorithm research: using low-precision units to emulate double-precision accuracy through schemes such as Ozaki decompositions. We advance this direction with Automatic Dynamic Precision (ADP), a fully GPU-resident framework that makes emulated FP64 matrix multiplication both efficient and reliable. At its core is the Exponent Span Capacity (ESC), a hardware-agnostic estimator that conservatively determines the decomposition parameter (also known as slices) required to achieve FP64-level accuracy. Built on ESC, ADP integrates exception handling, run time heuristics, and seamless fallback to native FP64, ensuring correctness without host-device synchronization or user intervention. Additionally, we further improve Ozaki-style decompositions with an unsigned integer slicing scheme, which increases representational efficiency and reduces computational waste. Validated against recently proposed BLAS grading tests, ADP consistently preserves FP64 fidelity on challenging inputs while incurring less than 10% run time overhead. In a 55-bit mantissa setting, our approach achieves up to 2.3x and 13.2x speedups over native FP64 GEMM on NVIDIA Blackwell GB200 and the RTX Pro 6000 Blackwell Server Edition, respectively. Our results demonstrate that low-precision accelerators can serve as a practical, production-ready foundation for high-fidelity and high-performance scientific computing workloads.

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

TL;DR

The paper addresses the need for reliable accuracy on low-precision Tensor Cores by introducing a GPU-resident framework that combines Exponent Span Capacity (ESC) estimation with Automatic Dynamic Precision (ADP). It introduces unsigned slice encoding to improve mantissa utilization and demonstrates that ADP can guarantee -level accuracy with less than overhead, achieving up to and speedups over native on different NVIDIA architectures and enabling production-grade use via cuBLAS/cuSOLVER integration. The core contributions are the ESC estimator, the ADP workflow with safety guardrails and fallback, and the open-source release for reproducibility and adoption. The results show that low-precision accelerators can be harnessed for high-fidelity scientific computing workloads without sacrificing performance, paving the way for safer integration of FP64 emulation in production HPC libraries.

Abstract

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput than traditional FP64 pipelines. This hardware shift has sparked a new line of algorithm research: using low-precision units to emulate double-precision accuracy through schemes such as Ozaki decompositions. We advance this direction with Automatic Dynamic Precision (ADP), a fully GPU-resident framework that makes emulated FP64 matrix multiplication both efficient and reliable. At its core is the Exponent Span Capacity (ESC), a hardware-agnostic estimator that conservatively determines the decomposition parameter (also known as slices) required to achieve FP64-level accuracy. Built on ESC, ADP integrates exception handling, run time heuristics, and seamless fallback to native FP64, ensuring correctness without host-device synchronization or user intervention. Additionally, we further improve Ozaki-style decompositions with an unsigned integer slicing scheme, which increases representational efficiency and reduces computational waste. Validated against recently proposed BLAS grading tests, ADP consistently preserves FP64 fidelity on challenging inputs while incurring less than 10% run time overhead. In a 55-bit mantissa setting, our approach achieves up to 2.3x and 13.2x speedups over native FP64 GEMM on NVIDIA Blackwell GB200 and the RTX Pro 6000 Blackwell Server Edition, respectively. Our results demonstrate that low-precision accelerators can serve as a practical, production-ready foundation for high-fidelity and high-performance scientific computing workloads.

Paper Structure

This paper contains 25 sections, 8 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Unsigned slice encoding using two's complement arithmetic. Case 1: u8 values in [0,127] map directly to s8 without modification. Case 2: u8 values in [128,255] are remapped as $256-x$ with a $+256$ carry to the higher slice, while storing $-x$ in s8. Bit patterns are preserved, e.g., 200 (u8) $\equiv$ -56 (s8) = 0b11001000.
  • Figure 2: ADP-enabled DGEMM, configured with six distinct mantissa bit counts for the Ozaki-I algorithm, on Test 2, where $n = 1024$. For each mantissa bit count, we display the variant without the option to fall back to native FP64 DGEMM (solid lines) and the variant with guardrails and automatic fallback to native FP64 DGEMM (dashed lines).
  • Figure 3: Maximum componentwise relative error when multiplying two random uniformly distributed matrices.
  • Figure 4: Average componentwise relative error when multiplying two random uniformly distributed matrices.
  • Figure 5: Breakdown of DGEMM performance when emulating 55 mantissa bits on NVIDIA Tensor Cores. For this experiment, ADP is forced to always use 55 bits, regardless of input characteristics, in order to maximize its relative overhead (worst-case configuration).
  • ...and 3 more figures