Table of Contents
Fetching ...

HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads

Pranav Suryadevara

TL;DR

This work addresses memory bottlenecks in ML workloads running on RISC-V by proposing HERMES, a unified memory hierarchy that combines a shared L3 cache, hybrid DRAM+HBM memory, tensor-aware caching, and advanced prefetching to service accelerators like Gemmini. The approach targets reduced latency, higher bandwidth, improved cache locality, and lower energy, enabling scalable, low-latency ML computation on open RISC-V platforms. Simulation results indicate latency reductions up to 33%, bandwidth gains up to 68%, cache-hit-rate improvements up to 50% (from 60% to 90%), and energy per operation reductions up to 30% versus a baseline. Overall, HERMES demonstrates the viability and benefits of cohesive memory subsystem design for open, accelerator-rich ML workloads on RISC-V, guiding toward scalable, efficient ML inference and training.

Abstract

The growth of machine learning (ML) workloads has underscored the importance of efficient memory hierarchies to address bandwidth, latency, and scalability challenges. HERMES focuses on optimizing memory subsystems for RISC-V architectures to meet the computational needs of ML models such as CNNs, RNNs, and Transformers. This project explores state-of-the-art techniques such as advanced prefetching, tensor-aware caching, and hybrid memory models. The cornerstone of HERMES is the integration of shared L3 caches with fine-grained coherence protocols equipped with specialized pathways to deep-learning accelerators such as Gemmini. Simulation tools like Gem5 and DRAMSim2 were used to evaluate baseline performance and scalability under representative ML workloads. The findings of this study highlight the design choices, and the anticipated challenges, paving the way for low-latency scalable memory operations for ML applications.

HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads

TL;DR

This work addresses memory bottlenecks in ML workloads running on RISC-V by proposing HERMES, a unified memory hierarchy that combines a shared L3 cache, hybrid DRAM+HBM memory, tensor-aware caching, and advanced prefetching to service accelerators like Gemmini. The approach targets reduced latency, higher bandwidth, improved cache locality, and lower energy, enabling scalable, low-latency ML computation on open RISC-V platforms. Simulation results indicate latency reductions up to 33%, bandwidth gains up to 68%, cache-hit-rate improvements up to 50% (from 60% to 90%), and energy per operation reductions up to 30% versus a baseline. Overall, HERMES demonstrates the viability and benefits of cohesive memory subsystem design for open, accelerator-rich ML workloads on RISC-V, guiding toward scalable, efficient ML inference and training.

Abstract

The growth of machine learning (ML) workloads has underscored the importance of efficient memory hierarchies to address bandwidth, latency, and scalability challenges. HERMES focuses on optimizing memory subsystems for RISC-V architectures to meet the computational needs of ML models such as CNNs, RNNs, and Transformers. This project explores state-of-the-art techniques such as advanced prefetching, tensor-aware caching, and hybrid memory models. The cornerstone of HERMES is the integration of shared L3 caches with fine-grained coherence protocols equipped with specialized pathways to deep-learning accelerators such as Gemmini. Simulation tools like Gem5 and DRAMSim2 were used to evaluate baseline performance and scalability under representative ML workloads. The findings of this study highlight the design choices, and the anticipated challenges, paving the way for low-latency scalable memory operations for ML applications.

Paper Structure

This paper contains 8 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: HERMES Memory Hierarchy Architecture Diagram
  • Figure 2: Memory Latency Comparison: HERMES configurations reduce latency compared to the baseline RISC-V system. The shared L3 cache, advanced prefetching, and tensor-aware caching contribute to this improvement.
  • Figure 3: Bandwidth Utilization Comparison: The hybrid memory model in HERMES increases bandwidth significantly, supporting high data transfer needs of ML workloads.
  • Figure 4: Cache Hit Rate Comparison: Tensor-aware caching in HERMES improves cache hit rates by reducing data evictions and optimizing data reuse patterns.
  • Figure 5: Energy Consumption Comparison: HERMES reduces energy consumption by minimizing off-chip memory accesses and employing efficient caching strategies.