Table of Contents
Fetching ...

Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic

Ahmad Abdelfattah, Jack Dongarra, Massimiliano Fasi, Mantas Mikaitis, Françoise Tisseur

Abstract

Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.

Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic

Abstract

Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.

Paper Structure

This paper contains 20 sections, 57 equations, 12 figures.

Figures (12)

  • Figure 1: Bit splitting to obtain the slices for the two matrices in \ref{['eq:ex-mats']}. On the left, the matrix entries are represented using a radix-2 scientific notation with $p=8$. The second step uses a block fixed-point representation with a common scale and 12 significant bits. The bits that are prepended or appended, compared with the previous step, are greyed out, and the leading bit (stricken out) is always zero. The final step contains the slicing of each matrix into matrices with elements in $\mathbb{I}_{3}$.
  • Figure 1: Error \ref{['eq:fwd-err-def']} for the vectors in \ref{['eq:simple-dot-product']} with $\varphi$ between 0 and 100.
  • Figure 2: Products computed by different variants of the integer Ozaki scheme. The constant in each box is the scaling factor to be applied to the product of the slice of $A$ in the corresponding row and the slice of $B$ in the corresponding column. The algorithm of Ootomo, Ozaki, and Yokota ooy24 only computes the products corresponding to boxes with a solid edge, and it accumulates them in floating-point arithmetic. Uchino, Ozaki, and Imamura uoi25 use integer arithmetic to accumulate the matrices with the same scale factor (along the black diagonals) followed by accumulation of partial sums in floating-point arithmetic.
  • Figure 2: Error \ref{['eq:fwd-err-def-mat']} obtained by replicating the set up in ooy24.
  • Figure 3: Alignment of bits in the 16 products of the form $A_{(\ell)}B^{(h)}$ for the slices in \ref{['fig:ex-splitting']}. The dashed lines separate blocks of partial products with the same scale factor, which lie along the same diagonal in \ref{['fig:products']}. The products below the thin, solid line correspond to the greyed-out boxes with a dashed border in \ref{['fig:products']}. The value below the thick solid line is the full-precision fixed-point representation of the result including all products. In this case, this is the exact result, because all the bits in $A$ and $B$ were allocated to a slice, and all slices were used in the computation.
  • ...and 7 more figures