Table of Contents
Fetching ...

Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV 1.0 Compliant Open-Source Processor

Matteo Perotti, Matheus Cavalcante, Renzo Andri, Lukas Cavigelli, Luca Benini

TL;DR

Ara2 tackles the need for an open, RVV 1.0–compliant vector processor by implementing a 22nm, open-source design with 2–16 lanes and modular scalar-vector integration. It systematically evaluates performance and energy efficiency across a diverse kernel set, reveals that average throughput ideality sits around 50% at moderate vector sizes, and demonstrates that multi-core configurations can yield up to 3x speedups for certain workloads. The paper highlights architectural decisions, such as a dispatcher-based decoding path and coherence mechanisms, and provides detailed physical metrics, including up to 1.35 GHz clock and 37.8 DP-GFLOPS/W at 0.8V. Overall, Ara2 establishes a concrete, open benchmark for RVV 1.0 vector processing, offers insights into scalability and bottlenecks, and compares favorably with state-of-the-art designs while underscoring the trade-offs between single-core and multi-core vector architectures.

Abstract

Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.

Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV 1.0 Compliant Open-Source Processor

TL;DR

Ara2 tackles the need for an open, RVV 1.0–compliant vector processor by implementing a 22nm, open-source design with 2–16 lanes and modular scalar-vector integration. It systematically evaluates performance and energy efficiency across a diverse kernel set, reveals that average throughput ideality sits around 50% at moderate vector sizes, and demonstrates that multi-core configurations can yield up to 3x speedups for certain workloads. The paper highlights architectural decisions, such as a dispatcher-based decoding path and coherence mechanisms, and provides detailed physical metrics, including up to 1.35 GHz clock and 37.8 DP-GFLOPS/W at 0.8V. Overall, Ara2 establishes a concrete, open benchmark for RVV 1.0 vector processing, offers insights into scalability and bottlenecks, and compares favorably with state-of-the-art designs while underscoring the trade-offs between single-core and multi-core vector architectures.

Abstract

Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.
Paper Structure (20 sections, 20 figures, 5 tables)

This paper contains 20 sections, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Top-Level block diagram of the Ara2 system with the vector co-processor (green), a more detailed diagram of the lane (magenta), and the host scalar core CVA6 (blue).
  • Figure 2: Comparison between the baseline and the optimized slide units, with a focus on how an arbitrary byte (15th) is mapped to the input bytes. The baseline slide unit supports arbitrary slide amounts and can slide and re-encode a vector in the same cycle. Each output byte is mapped to every input byte, so that the total number of connections is O($L^2$). In the optimized design, we support only power-of-two slides. Moreover, slides and re-encoding cannot happen at the same time. The total number of connections grows following an O($L \times log_2(L)$) behavior.
  • Figure 3: Number of 2-to-1 multiplexers needed to implement a slide unit as a function of the number of lanes with different configurations. An all-to-all slide unit connects every input byte to every output byte and supports every slide and reshuffle operation in one cycle. Other configurations support only slides by powers of two (SlideP2), slides by one (Slide1), and can slide and reshuffle or either slide or reshuffle.
  • Figure 4: Correlation between Raw throughput Ideality and Byte/lane ratio for dotproduct (left) and fmatmul (right). The raw throughput ideality tends to be similar when the number of elements per lane is the same (on the diagonals).
  • Figure 5: Performance evaluation of the system with different configurations and number of lanes on different kernels and vector lengths.
  • ...and 15 more figures