Register Dispersion: Reducing the Footprint of the Vector Register File in Vector Engines of Low-Cost RISC-V CPUs
Vasileios Titopoulos, George Alexakis, Kosmas Alexandridis, Chrysostomos Nicopoulos, Giorgos Dimitrakopoulos
TL;DR
The paper tackles the challenge of adding vector processing to ultra-low-cost CPUs by addressing the large hardware footprint of the VRF. It proposes Register Dispersion, a cache-like compact VRF (cVRF) that stores only the most recently accessed architectural vector registers while dispersing the rest into the memory hierarchy, preserving ISA compatibility with the RVV extension. Hardware evaluation shows substantial area (≈53% VRF, ≈23% CPU+VPU) and power (≈10% average) savings, with minimal or no performance loss when using a cVRF of eight 256-bit registers for common ML workloads. This approach enables practical, energy-efficient vector acceleration at the edge, balancing cost, performance, and power for ML applications.
Abstract
The deployment of Machine Learning (ML) applications at the edge on resource-constrained devices has accentuated the need for efficient ML processing on low-cost processors. While traditional CPUs provide programming flexibility, their general-purpose architecture often lacks the throughput required for complex ML models. The augmentation of a RISC-V processor with a vector unit can provide substantial data-level parallelism. However, increasing the data-level parallelism supported by vector processing would make the Vector Register File (VRF) a major area consumer in ultra low-cost processors, since 32 vector registers are required for RISC-V Vector ISA compliance. This work leverages the insight that many ML vectorized kernels require a small number of active vector registers, and proposes the use of a physically smaller VRF that dynamically caches only the vector registers currently accessed by the application. This approach, called Register Dispersion, maps the architectural vector registers to a smaller set of physical registers. The proposed ISA-compliant VRF is significantly smaller than a full-size VRF and operates like a conventional cache, i.e., it only stores the most recently accessed vector registers. Essential registers remain readily accessible within the compact VRF, while the others are offloaded to the cache/memory sub-system. The compact VRF design is demonstrated to yield substantial area and power savings, as compared to using a full VRF, with no or minimal impact on performance. This effective trade-off renders the inclusion of vector units in low-cost processors feasible and practical.
