Fast and Practical Strassen's Matrix Multiplication using FPGAs
Afzal Ahmad, Linfeng Du, Wei Zhang
TL;DR
This work targets efficient general matrix multiply (GeMM) on FPGAs by presenting a two-level Strassen's squared algorithm that is plug-compatible with a high-performance GeMM kernel. By using on-chip 4×4 block buffers to reuse submatrices and a dedicated LHS/RHS workflow, the design computes 49 intermediate results ($m_0$ to $m_{48}$) via 49 GeMM calls while streaming and accumulating directly into output buffers. Empirical results on Alveo U50/U280 show that, for low-precision data, the Strassen-based kernel can outperform optimized GeMM implementations (e.g., up to $1.85\times$ speedup for int8 and $158.8$ GOPS for int16) with comparable resource use, and that memory interfaces (HBM vs DDR) significantly influence performance. The findings demonstrate practical Strassen-enabled FPGA accelerators for small-to-medium matrix sizes and low-precision arithmetic, highlighting meaningful gains in throughput and energy efficiency in dnn-like workloads.
Abstract
Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of $\mathcal{O}(n^3)$ for $n\times n$ matrices. Strassen's algorithm improves this to $\mathcal{O}(n^{2.807})$, but its practicality is limited for small to medium matrix sizes due to the large number of additions it introduces. This paper presents a novel FPGA-based implementation of Strassen's algorithm that achieves superior speed over an optimized General Matrix Multiply (GeMM) implementation for matrices as small as $n=256$. Our design, tested extensively on two high-performance FPGA accelerators (Alveo U50 and U280) across various data types, matches or surpasses the performance of a highly optimized baseline across a range of matrix sizes.
