Table of Contents
Fetching ...

AraOS: Analyzing the Impact of Virtual Memory Management on Vector Unit Performance

Matteo Perotti, Vincenzo Maisto, Moritz Imfeld, Nils Wistoff, Alessandro Cilardo, Luca Benini

TL;DR

This work tackles how virtual memory management affects vector-unit performance in open-source RVV hardware. It introduces AraOS, which couples the Ara2 vector accelerator to the CVA6 MMU within the Cheshire platform, and evaluates performance with matrix-multiply kernels and the RiVEC benchmark suite on Linux. Key findings show virtual-memory overhead staying under 3.5% for practical TLB configurations, and two-lane AraOS achieving an average of 3.2x speedups with up to 39% greater area efficiency compared to a scalar baseline. The results establish AraOS as a practical open-reference design for VM-supported vector processing and provide guidance for future vector-processor optimizations in OS environments.

Abstract

Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution.

AraOS: Analyzing the Impact of Virtual Memory Management on Vector Unit Performance

TL;DR

This work tackles how virtual memory management affects vector-unit performance in open-source RVV hardware. It introduces AraOS, which couples the Ara2 vector accelerator to the CVA6 MMU within the Cheshire platform, and evaluates performance with matrix-multiply kernels and the RiVEC benchmark suite on Linux. Key findings show virtual-memory overhead staying under 3.5% for practical TLB configurations, and two-lane AraOS achieving an average of 3.2x speedups with up to 39% greater area efficiency compared to a scalar baseline. The results establish AraOS as a practical open-reference design for VM-supported vector processing and provide guidance for future vector-processor optimizations in OS environments.

Abstract

Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution.

Paper Structure

This paper contains 6 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: AraOS with shared MMU to enable virtual memory.
  • Figure 2: AraOS integration in Cheshire SoC and perf. overhead for a matrix multiplication on different problem sizes.