Table of Contents
Fetching ...

MMA-Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores

Peichen Xie, Yang Wang, Fan Yang, Mao Yang

TL;DR

MMA-Sim delivers the first bit-accurate reference model for matrix multiplication accelerators across ten GPU architectures, revealing nine arithmetic algorithms that capture the exact behavior of MMAs. It uses a rigorous testing-based workflow to infer summation order, accumulation precision, rounding modes, and special-value handling, and validates results against hardware with bitwise equivalence on over a million test cases. The model exposes undocumented behaviors (e.g., FP8 precision shifts, asymmetric rounding on CDNA3) and provides practical guidance for software and hardware designers to improve numerical stability and reproducibility in DNN workloads. By being open-source, MMA-Sim enables cross-vendor numerical analysis, reproducibility studies, and hardware-aware optimization for precision-sensitive AI systems.

Abstract

The rapidly growing computation demands of deep neural networks (DNNs) have driven hardware vendors to integrate matrix multiplication accelerators (MMAs), such as NVIDIA Tensor Cores and AMD Matrix Cores, into modern GPUs. However, due to distinct and undocumented arithmetic specifications for floating-point matrix multiplication, some MMAs can lead to numerical imprecision and inconsistency that can compromise the stability and reproducibility of DNN training and inference. This paper presents MMA-Sim, the first bit-accurate reference model that reveals the detailed arithmetic behaviors of the MMAs from ten GPU architectures (eight from NVIDIA and two from AMD). By dissecting the MMAs using a combination of targeted and randomized tests, our methodology derives nine arithmetic algorithms to simulate the floating-point matrix multiplication of the MMAs. Large-scale validation confirms bitwise equivalence between MMA-Sim and the real hardware. Using MMA-Sim, we investigate arithmetic behaviors that affect DNN training stability, and identify undocumented behaviors that could lead to significant errors.

MMA-Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores

TL;DR

MMA-Sim delivers the first bit-accurate reference model for matrix multiplication accelerators across ten GPU architectures, revealing nine arithmetic algorithms that capture the exact behavior of MMAs. It uses a rigorous testing-based workflow to infer summation order, accumulation precision, rounding modes, and special-value handling, and validates results against hardware with bitwise equivalence on over a million test cases. The model exposes undocumented behaviors (e.g., FP8 precision shifts, asymmetric rounding on CDNA3) and provides practical guidance for software and hardware designers to improve numerical stability and reproducibility in DNN workloads. By being open-source, MMA-Sim enables cross-vendor numerical analysis, reproducibility studies, and hardware-aware optimization for precision-sensitive AI systems.

Abstract

The rapidly growing computation demands of deep neural networks (DNNs) have driven hardware vendors to integrate matrix multiplication accelerators (MMAs), such as NVIDIA Tensor Cores and AMD Matrix Cores, into modern GPUs. However, due to distinct and undocumented arithmetic specifications for floating-point matrix multiplication, some MMAs can lead to numerical imprecision and inconsistency that can compromise the stability and reproducibility of DNN training and inference. This paper presents MMA-Sim, the first bit-accurate reference model that reveals the detailed arithmetic behaviors of the MMAs from ten GPU architectures (eight from NVIDIA and two from AMD). By dissecting the MMAs using a combination of targeted and randomized tests, our methodology derives nine arithmetic algorithms to simulate the floating-point matrix multiplication of the MMAs. Large-scale validation confirms bitwise equivalence between MMA-Sim and the real hardware. Using MMA-Sim, we investigate arithmetic behaviors that affect DNN training stability, and identify undocumented behaviors that could lead to significant errors.

Paper Structure

This paper contains 29 sections, 7 equations, 2 figures, 4 tables, 3 algorithms.

Figures (2)

  • Figure 1: Four typical summation orders on Tensor Cores and Matrix Cores. #$i$ represents the product $a_ib_i$.
  • Figure 2: Distributions of $\delta_{RD}$, numerical deviation of CDNA3 FP16 MFMA instruction that uses the rounding-down (RD) mode, and $\delta_{RZ}$, numerical deviation of a hypothetical FP16 MFMA instruction that uses the rounding-to-zero (RZ) mode.