Table of Contents
Fetching ...

ARCANE: Adaptive RISC-V Cache Architecture for Near-memory Extensions

Vincenzo Petrolo, Flavia Guella, Michele Caon, Pasquale Davide Schiavone, Guido Masera, Maurizio Martina

TL;DR

ARCANE tackles the memory-wall bottleneck by embedding a programmable compute-capable cache that doubles as a near-memory coprocessor using a RISC-V cache controller and a software-defined in-cache ISA. It leverages NM-Carus vector-like units and the CV-X-IF bridge to offload complex matrix operations, enabling efficient execution directly on data residing in the LLC. The key contributions include a fully functional cache architecture with locking, hazard management, and a hierarchical software runtime (cos) that decodes, schedules, and allocates matrix kernels, plus an extendable xmnmc instruction set with Matrix Reserve and Matrix Kernels. Experimental results show substantial performance gains (up to $84\times$ on 8-bit CNN workloads) with moderate area overhead ($41.3\%$) and favorable scaling for large inputs, highlighting ARCANE’s potential for edge computing and energy-efficient data-intensive processing.

Abstract

Modern data-driven applications expose limitations of von Neumann architectures - extensive data movement, low throughput, and poor energy efficiency. Accelerators improve performance but lack flexibility and require data transfers. Existing compute in- and near-memory solutions mitigate these issues but face usability challenges due to data placement constraints. We propose a novel cache architecture that doubles as a tightly-coupled compute-near-memory coprocessor. Our RISC-V cache controller executes custom instructions from the host CPU using vector operations dispatched to near-memory vector processing units within the cache memory subsystem. This architecture abstracts memory synchronization and data mapping from application software while offering software-based Instruction Set Architecture extensibility. Our implementation shows $30\times$ to $84\times$ performance improvement when operating on 8-bit data over the same system with a traditional cache when executing a worst-case 32-bit CNN workload, with only $41.3\%$ area overhead.

ARCANE: Adaptive RISC-V Cache Architecture for Near-memory Extensions

TL;DR

ARCANE tackles the memory-wall bottleneck by embedding a programmable compute-capable cache that doubles as a near-memory coprocessor using a RISC-V cache controller and a software-defined in-cache ISA. It leverages NM-Carus vector-like units and the CV-X-IF bridge to offload complex matrix operations, enabling efficient execution directly on data residing in the LLC. The key contributions include a fully functional cache architecture with locking, hazard management, and a hierarchical software runtime (cos) that decodes, schedules, and allocates matrix kernels, plus an extendable xmnmc instruction set with Matrix Reserve and Matrix Kernels. Experimental results show substantial performance gains (up to on 8-bit CNN workloads) with moderate area overhead () and favorable scaling for large inputs, highlighting ARCANE’s potential for edge computing and energy-efficient data-intensive processing.

Abstract

Modern data-driven applications expose limitations of von Neumann architectures - extensive data movement, low throughput, and poor energy efficiency. Accelerators improve performance but lack flexibility and require data transfers. Existing compute in- and near-memory solutions mitigate these issues but face usability challenges due to data placement constraints. We propose a novel cache architecture that doubles as a tightly-coupled compute-near-memory coprocessor. Our RISC-V cache controller executes custom instructions from the host CPU using vector operations dispatched to near-memory vector processing units within the cache memory subsystem. This architecture abstracts memory synchronization and data mapping from application software while offering software-based Instruction Set Architecture extensibility. Our implementation shows to performance improvement when operating on 8-bit data over the same system with a traditional cache when executing a worst-case 32-bit CNN workload, with only area overhead.

Paper Structure

This paper contains 22 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: X-HEEP system level block diagram with a detailed view of the ARCANE llc and software stack.
  • Figure 1: Example of ARCANE custom kernels.
  • Figure 2: Area split of xheep + ARCANE 4-lanes configuration (128 iB) versus xheep + standard data llc (128 iB)
  • Figure 3: Non-compute phases overhead analysis under different input matrix sizes and ARCANE lanes with int32 datatype.
  • Figure 4: Speedup comparison between single instance ARCANE configurations, CV32E40X and CV32E40PX featuring XCVPULP extensions.