Table of Contents
Fetching ...

Evaluating IOMMU-Based Shared Virtual Addressing for RISC-V Embedded Heterogeneous SoCs

Cyril Koenig, Enrico Zelioli, Luca Benini

TL;DR

This paper evaluates IOMMU-based shared virtual addressing for RISC-V embedded heterogeneous SoCs by integrating an open-source Pinto IOMMU into a CVA6 host and Snitch accelerator platform and benchmarking with RajaPERF/OpenMP on FPGA-emulated memory hierarchies. A key finding is that IO virtual address translation can contribute $4.2\%$–$17.6\%$ of accelerator runtime for gemm at varying memory bandwidths, but this overhead drops to $0.4\%$–$0.7\%$ when a shared last-level cache is present, making shared addressing viable. The study also shows that without LLC overheads can be as high as $81.3\%$ (heat3d) or $17.6\%$ (gemm), while with LLC the overhead remains below $2\%$ across kernels, underscoring the importance of memory hierarchy design. Overall, the work demonstrates that combining a shared LLC with IOMMU-based shared virtual addressing enables zero-copy offloading and practical data sharing in open-source RISC-V heterogeneous SoCs. It provides concrete guidance for system architects aiming to deploy shared virtual memory in energy-efficient embedded accelerators without bespoke IO coalescing or prefetching.

Abstract

Embedded heterogeneous systems-on-chip (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multi-core accelerators feature a cluster of processing elements and tightly coupled scratchpad memories to balance performance, energy efficiency, and flexibility. In embedded systems running a general-purpose OS, accelerators access data via dedicated, physically addressed memory regions. This negatively impacts memory utilization and performance by requiring a copy from the virtual host address to the physical accelerator address space. Input-Output Memory Management Units (IOMMUs) overcome this limitation by allowing devices and hosts to use a shared virtual paged address space. However, resolving IO virtual addresses can be particularly costly on high-latency memory systems as it requires up to three sequential memory accesses on IOTLB miss. In this work, we present a quantitative evaluation of shared virtual addressing in RISC-V heterogeneous embedded systems. We integrate an IOMMU in an open-source heterogeneous RISC-V SoC consisting of a 64-bit host with a 32-bit accelerator cluster. We evaluated the system performance by emulating the design on FPGA and implementing compute kernels from the RajaPERF benchmark suite using heterogeneous OpenMP programming. We measure the transfers and computation time on the host and accelerators for systems with different DRAM access latencies. We first show that IO virtual address translation can account for 4.2% up to 17.6% of the accelerator's runtime for gemm (General Matrix Multiplication) at low and high memory bandwidth. Then, we show that in systems containing a last-level cache, this IO address translation cost falls to 0.4% and 0.7% under the same conditions, making shared virtual addressing and zero-copy offloading suitable for such RISC-V heterogeneous SoCs.

Evaluating IOMMU-Based Shared Virtual Addressing for RISC-V Embedded Heterogeneous SoCs

TL;DR

This paper evaluates IOMMU-based shared virtual addressing for RISC-V embedded heterogeneous SoCs by integrating an open-source Pinto IOMMU into a CVA6 host and Snitch accelerator platform and benchmarking with RajaPERF/OpenMP on FPGA-emulated memory hierarchies. A key finding is that IO virtual address translation can contribute of accelerator runtime for gemm at varying memory bandwidths, but this overhead drops to when a shared last-level cache is present, making shared addressing viable. The study also shows that without LLC overheads can be as high as (heat3d) or (gemm), while with LLC the overhead remains below across kernels, underscoring the importance of memory hierarchy design. Overall, the work demonstrates that combining a shared LLC with IOMMU-based shared virtual addressing enables zero-copy offloading and practical data sharing in open-source RISC-V heterogeneous SoCs. It provides concrete guidance for system architects aiming to deploy shared virtual memory in energy-efficient embedded accelerators without bespoke IO coalescing or prefetching.

Abstract

Embedded heterogeneous systems-on-chip (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multi-core accelerators feature a cluster of processing elements and tightly coupled scratchpad memories to balance performance, energy efficiency, and flexibility. In embedded systems running a general-purpose OS, accelerators access data via dedicated, physically addressed memory regions. This negatively impacts memory utilization and performance by requiring a copy from the virtual host address to the physical accelerator address space. Input-Output Memory Management Units (IOMMUs) overcome this limitation by allowing devices and hosts to use a shared virtual paged address space. However, resolving IO virtual addresses can be particularly costly on high-latency memory systems as it requires up to three sequential memory accesses on IOTLB miss. In this work, we present a quantitative evaluation of shared virtual addressing in RISC-V heterogeneous embedded systems. We integrate an IOMMU in an open-source heterogeneous RISC-V SoC consisting of a 64-bit host with a 32-bit accelerator cluster. We evaluated the system performance by emulating the design on FPGA and implementing compute kernels from the RajaPERF benchmark suite using heterogeneous OpenMP programming. We measure the transfers and computation time on the host and accelerators for systems with different DRAM access latencies. We first show that IO virtual address translation can account for 4.2% up to 17.6% of the accelerator's runtime for gemm (General Matrix Multiplication) at low and high memory bandwidth. Then, we show that in systems containing a last-level cache, this IO address translation cost falls to 0.4% and 0.7% under the same conditions, making shared virtual addressing and zero-copy offloading suitable for such RISC-V heterogeneous SoCs.

Paper Structure

This paper contains 12 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Block diagram of the prototype platform. Note that the DRAM delayer is added on emulation to provide more accurate performance evaluation.
  • Figure 2: (Left) axpy$_{32.768}$ breakdown for three scenarios. Host-only execution; Data copy and device execution; Data mapping and device execution. (Right) Time spent copying or mapping data of different input sizes.
  • Figure 3: Data copying and mapping time with input size and different latencies.
  • Figure 4: Kernel execution for different latencies with three configurations. With IOMMU disabled, with IOMMU enabled and LLC disabled, and with IOMMU enabled and LLC enabled.
  • Figure 5: Average IOMMU page table walk time with and without and host interference for increasing latencies.