Table of Contents
Fetching ...

A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design

Matteo Perotti, Matheus Cavalcante, Nils Wistoff, Renzo Andri, Lukas Cavigelli, Luca Benini

TL;DR

This paper presents the first open-source implementation of the RVV $1.0$ vector extension tightly integrated with the CVA6 scalar core, enabling open exploration of lane-based vector architectures. It analyzes the architectural shifts from earlier RVV versions, including VRF global state, SLEN=VLEN stripe layout, and monomorphic per-type encoding, and details the associated hardware mechanisms (Mask Unit, reshuffle) necessary for correct and efficient operation. The Ara design demonstrates competitive performance and energy efficiency on a GlobalFoundries $22\text{FDX}$ process, achieving peak FPU utilization above $98\%$ on long vectors and delivering up to $37.1$ DP-GFLOPS/W, while reducing die area by over $15\%$ compared to the 0.5-unit baseline. By releasing the hardware and software stack publicly, the work enables broader open-source vector computing research and potential standardization of RVV $1.0$ implementations.

Abstract

Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.

A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design

TL;DR

This paper presents the first open-source implementation of the RVV vector extension tightly integrated with the CVA6 scalar core, enabling open exploration of lane-based vector architectures. It analyzes the architectural shifts from earlier RVV versions, including VRF global state, SLEN=VLEN stripe layout, and monomorphic per-type encoding, and details the associated hardware mechanisms (Mask Unit, reshuffle) necessary for correct and efficient operation. The Ara design demonstrates competitive performance and energy efficiency on a GlobalFoundries process, achieving peak FPU utilization above on long vectors and delivering up to DP-GFLOPS/W, while reducing die area by over compared to the 0.5-unit baseline. By releasing the hardware and software stack publicly, the work enables broader open-source vector computing research and potential standardization of RVV implementations.

Abstract

Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.
Paper Structure (27 sections, 2 equations, 7 figures, 3 tables)

This paper contains 27 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Top-Level block diagram of (the new) system with the vector co-processor marked in green, a more detailed diagram of the lane in magenta, and the host scalar core CVA6 in blue.
  • Figure 2: Runtime of matrix multiplication kernel of size $n \times n$ on our CVA6+Vector Unit system ($\blacksquare$), compared with the ideal dispatcher ($\square$), for several number of lanes $\ell$.
  • Figure 3: System throughput ideality relative to system with ideal dispatcher, as a function of CVA6's D-cache line size and data width.
  • Figure 4: Physical implementation of the full system. The lane is implemented and enclosed in a macro and then placed on the die. The system input and output are at the top of the die (AXI interface).
  • Figure 5: Physical implementation of a Lane. Modules without a label: lane sequencer, operand requesters (close to the ), and control logic for and (in the middle).
  • ...and 2 more figures