Table of Contents
Fetching ...

Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak, Mantas Mikaitis

TL;DR

This work tackles the lack of reproducible numerical behavior in mixed-precision NVIDIA tensor cores by developing MATLAB-based, hardware-validated models across V100, A100, A2/A30, L40S, H100/H200, and B200 GPUs. It introduces Generalised Numerical Feature Testing (GNFT) and Input Space Search Method (ISSM) to identify and refine the inner-product behavior, yielding accurate, configurable tensor-core models validated against hardware with randomized testing. The resulting MATLAB Tensor Core v0.2 toolbox enables researchers to reproduce, analyze, and experiment with tensor-core arithmetic, including a customizable model for user-defined variants and multi-word high-precision GEMM emulation. The work has practical impact for numerical analysis, cross-platform comparisons, and standardization efforts by providing transparent, testable models that align with hardware specifics.

Abstract

Matrix multiplication is a fundamental operation in for both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers-such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others-test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behavior of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

Accurate Models of NVIDIA Tensor Cores

TL;DR

This work tackles the lack of reproducible numerical behavior in mixed-precision NVIDIA tensor cores by developing MATLAB-based, hardware-validated models across V100, A100, A2/A30, L40S, H100/H200, and B200 GPUs. It introduces Generalised Numerical Feature Testing (GNFT) and Input Space Search Method (ISSM) to identify and refine the inner-product behavior, yielding accurate, configurable tensor-core models validated against hardware with randomized testing. The resulting MATLAB Tensor Core v0.2 toolbox enables researchers to reproduce, analyze, and experiment with tensor-core arithmetic, including a customizable model for user-defined variants and multi-word high-precision GEMM emulation. The work has practical impact for numerical analysis, cross-platform comparisons, and standardization efforts by providing transparent, testable models that align with hardware specifics.

Abstract

Matrix multiplication is a fundamental operation in for both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers-such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others-test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behavior of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Number of machines on the November TOP500 lists that suppport low-precision floating-point formats, and low- and mixed-precision matrix multiplication operations. NVIDIA, AMD, and Intel GPUs are included in the counts.
  • Figure 2: A model of the inner product within the V100 GPU tensor core. Here, RNE denotes round-to-nearest-even rounding mode.
  • Figure 3: A model of the inner product within the A100 GPU tensor core for the three input formats. A100 also has an fp64 tensor core, but that tensor core is compliant with the IEEE 754 FMA operation and is not shown here.
  • Figure 4: A model of the inner product within the tensor cores of the L40S and Ada Lovelace RXT 1000 GPU for the fp8 input format. For fp16, bf16 and tf19, the model is identical to A100 and is not shown.
  • Figure 5: A model of the inner product within the tensor cores of the H100/H200/B200 GPUs for (a) fp16/BF16, (b) tf19, (c) fp8 input format provided via mma.sync (which internally uses fp16 tensor core with interleaved input pattern) and (d) fp8 tensor core accessed via wgmma.mma_async (specific to Hopper architecture). The fp64 tensor core is compliant with IEEE 754 FMA operation and is not shown.
  • ...and 2 more figures