Table of Contents
Fetching ...

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators

Xinyi Li, Ang Li, Bo Fang, Katarzyna Swirydowicz, Ignacio Laguna, Ganesh Gopalakrishnan

TL;DR

A collection of Feature Targeted Tests for Numerical Properties that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits are contributed.

Abstract

NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This makes it impossible to reliably port codes across these differing accelerators. This paper contributes a collection of {\em Feature Targeted Tests for Numerical Properties} that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits. To show the practical relevance of FTTN, we design a simple matrix-multiplication test designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. Our matrix multiplication tests employ patterns found in iterative refinement-based algorithms, highlighting the need to check for significant result variability when porting code across GPUs.

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators

TL;DR

A collection of Feature Targeted Tests for Numerical Properties that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits are contributed.

Abstract

NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This makes it impossible to reliably port codes across these differing accelerators. This paper contributes a collection of {\em Feature Targeted Tests for Numerical Properties} that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits. To show the practical relevance of FTTN, we design a simple matrix-multiplication test designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. Our matrix multiplication tests employ patterns found in iterative refinement-based algorithms, highlighting the need to check for significant result variability when porting code across GPUs.
Paper Structure (24 sections, 5 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Matrix Unit Testing Approach. For the matrix accelerator under test, all the properties shown on the right are checked (all but monotonicity) or implied (monotonicity) by our tests.
  • Figure 2: The logic for test T_rnd_dir are presented here. By setting the $a_{11}b_{11}$ product as well as the $a_{12}b_{21}$ product (alternatively the $c_{11}$ value) to the indicated value, the execution is carried out (all other inputs not mentioned are set to $0$). Then by examining the $d_{11}$ output, we can decide which case we fall into with respect to the rounding being used. A similar logic also underlies the $T_{prod}$ test.
  • Figure 3: Testing workflow that sharpens each later test based on the previous ones. First settle the rounding mode of the accumulator (T_rnd_dir). Then settle the presence of an extra bit; if so then determine the initial rounding mode; then settle the use of 3 extra bits (T_1_bit, T_rnd_dir, and T_3_bits_fin_rnd); if so, check for ties and sticky bit. Having concluded the rounding mode, switch to settling FMA properties. Then the extra bits preserved. At that time, we can determine the block FMA width, accumulation order control (T_blk_fma_width, T_acc_order), and settle whether normalization happens once.
  • Figure 4: Binary Computation for Two Numbers Addition with Rounding to Nearest Mode. Here is how to read this figure. On the left, the situation of $a_{11}b_{11}$ (augend) with a specific input is shown. This value is aligned since the addend ($a_{12}b_{21}$ or $c_{11}$) has the higher exponent. Alignment under one, two, or three extra bits is shown underneath $a_{11}b_{11}$. The result produced by "rtn" (RTN-TE) is shown as emitted by the bottom red oval.
  • Figure 5: Binary Computation for Two Numbers Addition with Rounding to Zero Mode. Follow the reading suggestions as with Figure \ref{['fig:3-extra-rtn']}.