Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

Matteo Perotti; Samuel Riedel; Matheus Cavalcante; Luca Benini

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

Matteo Perotti, Samuel Riedel, Matheus Cavalcante, Luca Benini

TL;DR

Spatz addresses memory bottlenecks in conventional processor-based architectures by proposing a compact, open-source vector processing unit based on RISC-V Zve64d and integrated into a shared-L1 cluster. The approach uses a lean, latch-based VRF and a two-core Spatz design to maximize compute/data reuse, supported by an analytic energy model that balances L0 capacity and L1 bandwidth. Key contributions include a detailed hardware design (including ExSdotp), a scalable controller/VRF, and extensive implementation results showing high energy efficiency (up to 95.7 DP-GFLOPS/W) and strong area efficiency relative to scalar or SSR-based baselines. The findings demonstrate that small, tightly-coupled vector clusters can deliver competitive performance with substantially reduced memory traffic, making vector processing viable for energy-constrained edge and embedded contexts.

Abstract

The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64-bit floating-point-capable vector processor based on RISC-V's Vector Extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dual-core vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 DP-GFLOPS/W at 1 GHz and nominal operating conditions (TT, 0.80V, 25C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally-balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 DP-GFLOPS, and 99.3 DP-GFLOPS/W (61% of the power spent in the FPU) on a 2D workload with a 7x7 kernel, resulting in an outstanding area/energy efficiency of 171 DP-GFLOPS/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

TL;DR

Abstract

Paper Structure (27 sections, 9 equations, 15 figures, 4 tables)

This paper contains 27 sections, 9 equations, 15 figures, 4 tables.

Introduction
Vector Register File
Matching workload and VRF for optimal efficiency
Energy Consumption Model
FPUs
PEs
L0 SCM
L1 SPM
Energy Efficiency Optimization
Spatz: A Compact Vector Processing Unit
Instruction Dispatch
Controller
Vector Register File
Functional Units
Vector Arithmetic Unit
...and 12 more sections

Figures (15)

Figure 1: A shared-L1 cluster design with $C$ , each controlling $F$ , and a multi-banked L1 Scratchpad Memory () with $M$ banks.
Figure 2: Architecture of a latch-based with $R$$W$-byte-wide rows, for a total capacity $K = WR$ bytes.
Figure 3: Energy consumption of a latch-based with $R$ rows of width $W$ and capacity $K = WR$ bytes. The dashed lines correspond to the functions in \ref{['eq:4', 'eq:5']}.
Figure 4: Breakdown of the energy consumption per cycle of the shared-L1 cluster, as a function of its vector length $\textcolor{black}{\mathtt{VLENB}}$.
Figure 5: Energy efficiency $\Phi$ of the cluster executing a $256 \times 256$ matrix multiplication kernel, as a function of the vector length $\textcolor{black}{\mathtt{VLENB}}$.
...and 10 more figures

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

TL;DR

Abstract

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

Authors

TL;DR

Abstract

Table of Contents

Figures (15)