Table of Contents
Fetching ...

Strassen Multisystolic Array Hardware Architectures

Trevor E. Pogue, Nicola Nicolici

TL;DR

The paper addresses the gap between Strassen's theoretical matrix multiplication reductions and practical hardware acceleration by proposing Strassen-based multisystolic array architectures implemented on FPGA. By pipelining recursion-level data movement and using memory layouts that generate intermediate blocks on-the-fly, the designs realize hardware resource savings, notably DSP reductions by a factor of $1.14^r$ per recursion level while maintaining throughput for small matrices. The authors demonstrate end-to-end performance improvements in a DL accelerator context, achieving state-of-the-art results on FPGA while preserving or expanding the minimum usable matrix sizes. This work highlights a viable path to realize Strassen’s asymptotic benefits in specialized hardware, enabling higher throughput per DSP and better utilization for modern workloads. Overall, the approach offers a practical route to algebraic-level acceleration in GEMM-focused hardware with scalable benefits as recursion depth increases.

Abstract

While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of $1.14^r$ for $r$ implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.

Strassen Multisystolic Array Hardware Architectures

TL;DR

The paper addresses the gap between Strassen's theoretical matrix multiplication reductions and practical hardware acceleration by proposing Strassen-based multisystolic array architectures implemented on FPGA. By pipelining recursion-level data movement and using memory layouts that generate intermediate blocks on-the-fly, the designs realize hardware resource savings, notably DSP reductions by a factor of per recursion level while maintaining throughput for small matrices. The authors demonstrate end-to-end performance improvements in a DL accelerator context, achieving state-of-the-art results on FPGA while preserving or expanding the minimum usable matrix sizes. This work highlights a viable path to realize Strassen’s asymptotic benefits in specialized hardware, enabling higher throughput per DSP and better utilization for modern workloads. Overall, the approach offers a practical route to algebraic-level acceleration in GEMM-focused hardware with scalable benefits as recursion depth increases.

Abstract

While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of for implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.

Paper Structure

This paper contains 21 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example data layout for the $\mathbf{A}$ matrix in memory for an architecture implementing Strassen matrix multiplication for 2 levels of recursion ( FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.SMM/?<?2). Each address $i$ contains every $m^{th}$ row of $\mathbf{A}$ concatenated together starting at row $i$ (notated as $\mathbf{_{}}_{}$.A'i:m:,:). To help illustrate this, the gray coloured rows are all elements of $\mathbf{A}$ belonging to address 0, which forms $\mathbf{_{}}_{}$.A'0:m:,: containing row 0 of every $\mathbf{A}$ sub-block from the lowest level of recursion in (\ref{['smm:eq:strass-first']}). The organization for the $\mathbf{B}$ matrices in memory are the same, except that the order of the elements is transposed compared to the $\mathbf{A}$ matrix layout shown here.
  • Figure 2: Top-level diagram of the proposed FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.SMM/?< multisystolic array architecture for implementing Strassen matrix multiplication (\ref{['smm:eq:strass-first']})-(\ref{['smm:eq:strass-last']}) for $r$ levels of recursion in hardware.
  • Figure 3: Internal structure of the FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.SMM/?< MXU addition vectors from Fig. \ref{['smm:fig:smm-mxu']}.
  • Figure 4: Baseline FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.MM/?<? single-systolic array architecture that implements conventional matrix multiplication (\ref{['smm:eq:mmZero']}) in hardware, provided for completeness and clarity. It is instantiated at the lowest level of recursion in the FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.SMM/?< and FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.MM/?< MXU architectures. $X$ here represents the width of the $a$ and $b$ vectors entering the FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.MM/?<? MXU, and $Y$ represents the width of the $c$ vectors exiting the MXU.
  • Figure 5: Baseline FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$. FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.MM'*/FalseBooleanValue?< FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$ FalseBooleanValue_ FalseBooleanValue$_{}$ FalseBooleanValue_[ ] FalseBooleanValue$_{}\IfNoValueTF{}{[ {} {} ]}{}$.MM'/FalseBooleanValue?<?r multisystolic array architecture for implementing conventional blocked matrix multiplication (\ref{['smm:eq:mm']}) for $r$ levels of recursion in hardware.
  • ...and 2 more figures