A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design
Matteo Perotti, Matheus Cavalcante, Nils Wistoff, Renzo Andri, Lukas Cavigelli, Luca Benini
TL;DR
This paper presents the first open-source implementation of the RVV $1.0$ vector extension tightly integrated with the CVA6 scalar core, enabling open exploration of lane-based vector architectures. It analyzes the architectural shifts from earlier RVV versions, including VRF global state, SLEN=VLEN stripe layout, and monomorphic per-type encoding, and details the associated hardware mechanisms (Mask Unit, reshuffle) necessary for correct and efficient operation. The Ara design demonstrates competitive performance and energy efficiency on a GlobalFoundries $22\text{FDX}$ process, achieving peak FPU utilization above $98\%$ on long vectors and delivering up to $37.1$ DP-GFLOPS/W, while reducing die area by over $15\%$ compared to the 0.5-unit baseline. By releasing the hardware and software stack publicly, the work enables broader open-source vector computing research and potential standardization of RVV $1.0$ implementations.
Abstract
Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.
