Cascading GEMM: High Precision from Low Precision
Devangi N. Parikh, Robert A. van de Geijn, Greg M. Henry
TL;DR
The paper tackles computing $FP64\!x\!2$ GEMMs by cascading multiple $FP64$ GEMMs, enabling higher-precision results on hardware lacking native $FP64\!x\!2$ support. It introduces a ten-GEMM cascade within BLIS, backed by forward-error analysis, a practical prototype, and cancellation-detection mechanisms, achieving a favorable balance between accuracy and performance. The approach yields a viable, portable path to higher-precision GEMM and motivates broad opportunities for cascading in other precisions and on GPUs, with potential extensions to mixed-precision BLAS routines. The work offers a practical framework for high-precision computation using lower-precision kernels, sparking future research in scaling, performance modeling, and hardware-aware enhancements.
Abstract
This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. With this, it is shown how approximate FP64x2 GEMM accuracy can be cast in terms of ten ``cascading'' FP64 GEMMs. Promising results from preliminary performance and accuracy experiments are reported. The demonstrated techniques open up new research directions for more general cascading of higher-precision computation in terms of lower-precision computation for GEMM-like functionality.
