Table of Contents
Fetching ...

Quadrilatero: A RISC-V programmable matrix coprocessor for low-power edge applications

Danilo Cammarata, Matteo Perotti, Marco Bertuletti, Angelo Garofalo, Pasquale Davide Schiavone, David Atienza, Luca Benini

TL;DR

Quadrilatero addresses the VRF bandwidth bottleneck of vector-based edge accelerators by introducing a RISC-V programmable matrix coprocessor with a dedicated matrix ISA and a systolic-array MAC engine. The design, implemented as a coprocessor for an RV32I core and evaluated in a $65$-nm process, achieves $A\approx 0.65\,\mathrm{mm}^2$ and up to $99.4\%$ FPU utilization for $64\times64$ matmuls, while delivering significant area and energy advantages over a state-of-the-art vector processor and a hybrid vector-matrix processor. The open-source work demonstrates strong area efficiency gains (up to $77\%$) and energy reductions (up to $15\%$) across multiple configurations, validating matrix ISA as a practical approach for high-arithmetic-density edge AI workloads.

Abstract

The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive accesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many researchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V programmable systolic array coprocessor for low-power edge applications that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, showing that it requires only 0.65 mm^2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC-V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively.

Quadrilatero: A RISC-V programmable matrix coprocessor for low-power edge applications

TL;DR

Quadrilatero addresses the VRF bandwidth bottleneck of vector-based edge accelerators by introducing a RISC-V programmable matrix coprocessor with a dedicated matrix ISA and a systolic-array MAC engine. The design, implemented as a coprocessor for an RV32I core and evaluated in a -nm process, achieves and up to FPU utilization for matmuls, while delivering significant area and energy advantages over a state-of-the-art vector processor and a hybrid vector-matrix processor. The open-source work demonstrates strong area efficiency gains (up to ) and energy reductions (up to ) across multiple configurations, validating matrix ISA as a practical approach for high-arithmetic-density edge AI workloads.

Abstract

The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive accesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many researchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V programmable systolic array coprocessor for low-power edge applications that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, showing that it requires only 0.65 mm^2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC-V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively.

Paper Structure

This paper contains 5 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Pseudocode of a 8x8-based matmul with matrix instructions and its graphical representation.
  • Figure 2: Quadrilatero Architecture with RLEN = 128.
  • Figure 3: Gantt Chart of the intermediate loop of the matmul kernel executed by Quadrilatero. The inner loop is executed without stalls, while two consecutive intermediate loop iterations have only three cycles of losses on the memory port.
  • Figure 4: On the left, the system where we integrate Quadrilatero, Spatz with 4 FPUs and Spatz MX. On the right, the system where we integrate Spatz with 16 FPUs.
  • Figure 5: Experimental results on the comparison of the register file and FPU of the different systems.