ARCANE: Adaptive RISC-V Cache Architecture for Near-memory Extensions
Vincenzo Petrolo, Flavia Guella, Michele Caon, Pasquale Davide Schiavone, Guido Masera, Maurizio Martina
TL;DR
ARCANE tackles the memory-wall bottleneck by embedding a programmable compute-capable cache that doubles as a near-memory coprocessor using a RISC-V cache controller and a software-defined in-cache ISA. It leverages NM-Carus vector-like units and the CV-X-IF bridge to offload complex matrix operations, enabling efficient execution directly on data residing in the LLC. The key contributions include a fully functional cache architecture with locking, hazard management, and a hierarchical software runtime (cos) that decodes, schedules, and allocates matrix kernels, plus an extendable xmnmc instruction set with Matrix Reserve and Matrix Kernels. Experimental results show substantial performance gains (up to $84\times$ on 8-bit CNN workloads) with moderate area overhead ($41.3\%$) and favorable scaling for large inputs, highlighting ARCANE’s potential for edge computing and energy-efficient data-intensive processing.
Abstract
Modern data-driven applications expose limitations of von Neumann architectures - extensive data movement, low throughput, and poor energy efficiency. Accelerators improve performance but lack flexibility and require data transfers. Existing compute in- and near-memory solutions mitigate these issues but face usability challenges due to data placement constraints. We propose a novel cache architecture that doubles as a tightly-coupled compute-near-memory coprocessor. Our RISC-V cache controller executes custom instructions from the host CPU using vector operations dispatched to near-memory vector processing units within the cache memory subsystem. This architecture abstracts memory synchronization and data mapping from application software while offering software-based Instruction Set Architecture extensibility. Our implementation shows $30\times$ to $84\times$ performance improvement when operating on 8-bit data over the same system with a traditional cache when executing a worst-case 32-bit CNN workload, with only $41.3\%$ area overhead.
