Table of Contents
Fetching ...

Cascading GEMM: High Precision from Low Precision

Devangi N. Parikh, Robert A. van de Geijn, Greg M. Henry

TL;DR

The paper tackles computing $FP64\!x\!2$ GEMMs by cascading multiple $FP64$ GEMMs, enabling higher-precision results on hardware lacking native $FP64\!x\!2$ support. It introduces a ten-GEMM cascade within BLIS, backed by forward-error analysis, a practical prototype, and cancellation-detection mechanisms, achieving a favorable balance between accuracy and performance. The approach yields a viable, portable path to higher-precision GEMM and motivates broad opportunities for cascading in other precisions and on GPUs, with potential extensions to mixed-precision BLAS routines. The work offers a practical framework for high-precision computation using lower-precision kernels, sparking future research in scaling, performance modeling, and hardware-aware enhancements.

Abstract

This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. With this, it is shown how approximate FP64x2 GEMM accuracy can be cast in terms of ten ``cascading'' FP64 GEMMs. Promising results from preliminary performance and accuracy experiments are reported. The demonstrated techniques open up new research directions for more general cascading of higher-precision computation in terms of lower-precision computation for GEMM-like functionality.

Cascading GEMM: High Precision from Low Precision

TL;DR

The paper tackles computing GEMMs by cascading multiple GEMMs, enabling higher-precision results on hardware lacking native support. It introduces a ten-GEMM cascade within BLIS, backed by forward-error analysis, a practical prototype, and cancellation-detection mechanisms, achieving a favorable balance between accuracy and performance. The approach yields a viable, portable path to higher-precision GEMM and motivates broad opportunities for cascading in other precisions and on GPUs, with potential extensions to mixed-precision BLAS routines. The work offers a practical framework for high-precision computation using lower-precision kernels, sparking future research in scaling, performance modeling, and hardware-aware enhancements.

Abstract

This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. With this, it is shown how approximate FP64x2 GEMM accuracy can be cast in terms of ten ``cascading'' FP64 GEMMs. Promising results from preliminary performance and accuracy experiments are reported. The demonstrated techniques open up new research directions for more general cascading of higher-precision computation in terms of lower-precision computation for GEMM-like functionality.
Paper Structure (39 sections, 3 theorems, 55 equations, 10 figures, 1 table)

This paper contains 39 sections, 3 theorems, 55 equations, 10 figures, 1 table.

Key Result

theorem 1

Let $\epsilon_0, \cdots , \epsilon_{n-1}$ satisfy $\vert \epsilon_i \vert \leq \epsilon_{\rm mach}$. Then there exists a $\theta_n$ such that $( 1 + \theta_n ) = ( 1 + \epsilon_0 ) \cdots ( 1 + \epsilon_{n-1} )$, where $\vert \theta_n \vert \leq \gamma_n \epsilon_{\rm mach}$ with $\gamma_n = n \epsi

Figures (10)

  • Figure 1: Illustration of how FP64x2 number $\widehat{\chi}$ is cascaded into four FP64 splits. Here we assume that we start with a normalized number so that $\beta_0 = 1$. If $D_i \leq D$ for $i = 0, 1, 2$, then this also illustrates that additional precision can be stored in the cascaded number, which can improve accuracy when the cascaded representation is used for intermediary accumulation.
  • Figure 2: The BLIS refactoring of the GotoBLAS algorithm for GEMM as five loops around the microkernel. This diagram, which is often used when explaining the fundamental techniques that underly the BLIS implementation of GEMM, was modified from a similar image first published in BLIS5 and is used with permission.
  • Figure 3: GEMM performed as a series of rank-k updates.
  • Figure 4: Left: Packing layout for a panel of $B$. Right: Packing layout for a block of $A$. The colors correspond to the legend in Figure \ref{['fig:BLIS']}.
  • Figure 5: Top: Operation Goto's Algorithm performs in first loop around the micro-kernel. Bottom: Operation performed with cascaded matrices by first loop around the micro-kernel. The colors correspond to the legend in Figure \ref{['fig:BLIS']}.
  • ...and 5 more figures

Theorems & Definitions (5)

  • definition 1: Standard Computational Model
  • theorem 1
  • theorem 2
  • corollary 1
  • proof