HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads
Pranav Suryadevara
TL;DR
This work addresses memory bottlenecks in ML workloads running on RISC-V by proposing HERMES, a unified memory hierarchy that combines a shared L3 cache, hybrid DRAM+HBM memory, tensor-aware caching, and advanced prefetching to service accelerators like Gemmini. The approach targets reduced latency, higher bandwidth, improved cache locality, and lower energy, enabling scalable, low-latency ML computation on open RISC-V platforms. Simulation results indicate latency reductions up to 33%, bandwidth gains up to 68%, cache-hit-rate improvements up to 50% (from 60% to 90%), and energy per operation reductions up to 30% versus a baseline. Overall, HERMES demonstrates the viability and benefits of cohesive memory subsystem design for open, accelerator-rich ML workloads on RISC-V, guiding toward scalable, efficient ML inference and training.
Abstract
The growth of machine learning (ML) workloads has underscored the importance of efficient memory hierarchies to address bandwidth, latency, and scalability challenges. HERMES focuses on optimizing memory subsystems for RISC-V architectures to meet the computational needs of ML models such as CNNs, RNNs, and Transformers. This project explores state-of-the-art techniques such as advanced prefetching, tensor-aware caching, and hybrid memory models. The cornerstone of HERMES is the integration of shared L3 caches with fine-grained coherence protocols equipped with specialized pathways to deep-learning accelerators such as Gemmini. Simulation tools like Gem5 and DRAMSim2 were used to evaluate baseline performance and scalability under representative ML workloads. The findings of this study highlight the design choices, and the anticipated challenges, paving the way for low-latency scalable memory operations for ML applications.
