Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations
Trevor E. Pogue, Nicola Nicolici
TL;DR
This work extends the Karatsuba algorithm from scalar to matrix multiplication (KMM), preserving its multiplier-reducing advantages while mitigating the addition overhead that typically limits Karatsuba benefits for small bitwidths. It introduces a family of hardware architectures—baseline, fixed-precision, and precision-scalable—that map the KMM computation onto systolic arrays and MXUs suitable for GEMM in deep learning accelerators. Through a detailed complexity analysis and end-to-end evaluation on FPGA-based accelerators, the authors demonstrate improvements in throughput and area efficiency over conventional matrix multiplication and scalar Karatsuba implementations, including favorable performance-per-area in precision-scalable configurations and competitive results in fixed-precision designs. The results suggest that KMM and its hardware realizations can enhance integer-mominated DL workloads, including large GEMMs in CNNs and attention mechanisms, by delivering higher efficiency without sacrificing core architectural benefits like systolic locality and standard multiplier baselines.
Abstract
While the Karatsuba algorithm reduces the complexity of large integer multiplication, the extra additions required minimize its benefits for smaller integers of more commonly-used bitwidths. In this work, we propose the extension of the scalar Karatsuba multiplication algorithm to matrix multiplication, showing how this maintains the reduction in multiplication complexity of the original Karatsuba algorithm while reducing the complexity of the extra additions. Furthermore, we propose new matrix multiplication hardware architectures for efficiently exploiting this extension of the Karatsuba algorithm in custom hardware. We show that the proposed algorithm and hardware architectures can provide real area or execution time improvements for integer matrix multiplication compared to scalar Karatsuba or conventional matrix multiplication algorithms, while also supporting implementation through proven systolic array and conventional multiplier architectures at the core. We provide a complexity analysis of the algorithm and architectures and evaluate the proposed designs both in isolation and in an end-to-end deep learning accelerator system compared to baseline designs and prior state-of-the-art works implemented on the same type of compute platform, demonstrating their ability to increase the performance-per-area of matrix multiplication hardware.
