Fast and Practical Strassen's Matrix Multiplication using FPGAs

Afzal Ahmad; Linfeng Du; Wei Zhang

Fast and Practical Strassen's Matrix Multiplication using FPGAs

Afzal Ahmad, Linfeng Du, Wei Zhang

TL;DR

This work targets efficient general matrix multiply (GeMM) on FPGAs by presenting a two-level Strassen's squared algorithm that is plug-compatible with a high-performance GeMM kernel. By using on-chip 4×4 block buffers to reuse submatrices and a dedicated LHS/RHS workflow, the design computes 49 intermediate results ($m_0$ to $m_{48}$) via 49 GeMM calls while streaming and accumulating directly into output buffers. Empirical results on Alveo U50/U280 show that, for low-precision data, the Strassen-based kernel can outperform optimized GeMM implementations (e.g., up to $1.85\times$ speedup for int8 and $158.8$ GOPS for int16) with comparable resource use, and that memory interfaces (HBM vs DDR) significantly influence performance. The findings demonstrate practical Strassen-enabled FPGA accelerators for small-to-medium matrix sizes and low-precision arithmetic, highlighting meaningful gains in throughput and energy efficiency in dnn-like workloads.

Abstract

Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of $\mathcal{O}(n^3)$ for $n\times n$ matrices. Strassen's algorithm improves this to $\mathcal{O}(n^{2.807})$, but its practicality is limited for small to medium matrix sizes due to the large number of additions it introduces. This paper presents a novel FPGA-based implementation of Strassen's algorithm that achieves superior speed over an optimized General Matrix Multiply (GeMM) implementation for matrices as small as $n=256$. Our design, tested extensively on two high-performance FPGA accelerators (Alveo U50 and U280) across various data types, matches or surpasses the performance of a highly optimized baseline across a range of matrix sizes.

Fast and Practical Strassen's Matrix Multiplication using FPGAs

TL;DR

) via 49 GeMM calls while streaming and accumulating directly into output buffers. Empirical results on Alveo U50/U280 show that, for low-precision data, the Strassen-based kernel can outperform optimized GeMM implementations (e.g., up to

speedup for int8 and

GOPS for int16) with comparable resource use, and that memory interfaces (HBM vs DDR) significantly influence performance. The findings demonstrate practical Strassen-enabled FPGA accelerators for small-to-medium matrix sizes and low-precision arithmetic, highlighting meaningful gains in throughput and energy efficiency in dnn-like workloads.

Abstract

for

matrices. Strassen's algorithm improves this to

, but its practicality is limited for small to medium matrix sizes due to the large number of additions it introduces. This paper presents a novel FPGA-based implementation of Strassen's algorithm that achieves superior speed over an optimized General Matrix Multiply (GeMM) implementation for matrices as small as

. Our design, tested extensively on two high-performance FPGA accelerators (Alveo U50 and U280) across various data types, matches or surpasses the performance of a highly optimized baseline across a range of matrix sizes.

Paper Structure (18 sections, 3 equations, 6 figures, 1 table)

This paper contains 18 sections, 3 equations, 6 figures, 1 table.

Introduction
Background
Standard Matrix Multiplication
Anatomy of High-Performance FPGA-based GeMM
Strassen's Matrix Multiplication
Related Works
Proposed Implementation
Input Buffering/Reuse
Computing LHS and RHS
Computing $m_0, ..., m_{48}$
Output Buffers
The Outer Loops
OpenCL Host
Experiments and Results
Results and Comparison
...and 3 more sections

Figures (6)

Figure 1: Simplified illustration of the L1 GeMM module implemented by Vitis BLAS. The module takes $A^T$ and $B$ as inputs and utilizes shift registers and a systolic array to compute GeMM. The dotted vertical lines separate the different states of the registers as data flows through them. $t$ represents clock cycles.
Figure 2: Illustration of the L2 GeMM module implemented by Vitis BLAS. Pipes represent FIFO streams. The read module reads submatrices from external memory. Output of L1 GeMM has to be accumulated for computing the dot product between the rows and columns of $A$ and $B$. After the accumulation, the buffer contents are written to the external memory.
Figure 3: (a) Standard GeMM and (b) Strassen's matrix multiplication algorithm for $2\times 2$ submatrices and (c) Strassen's squared algorithm for $4\times 4$ submatrices. Shown in light blue and orange colors are the left hand side (LHS) and right hand side (RHS) of the intermediate computations $m_0$ to $m_7$/$m_6$ for $2\times 2$ and $m_0$ to $m_{48}$ for $4\times 4$ case, respectively. Green color shows the accumulation of intermediate results into the output submatrices.
Figure 4: Illustration of the Strassen$^2$ kernel. The major differences compared to the Vitis BLAS L2 GeMM (Fig. \ref{['fig:l2_gemm_vitis']}) are in Read/Buffer, computation of LHS/RHS, and the output buffering.
Figure 5: Comparison of measured performance of Strassen square kernel against Vitis BLAS GeMM for different data types and matrix sizes, on (a) Alveo U50 and (b) Alveo U280 FPGAs.
...and 1 more figures

Fast and Practical Strassen's Matrix Multiplication using FPGAs

TL;DR

Abstract

Fast and Practical Strassen's Matrix Multiplication using FPGAs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)