Table of Contents
Fetching ...

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

Xiaoling Yi, Ryan Antonio, Joren Dumoulin, Jiacong Sun, Josse Van Delm, Guilherme Paim, Marian Verhelst

TL;DR

This work proposes OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability, and demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety of GeMM workloads, while achieving 4.68 TOPS/W system efficiency.

Abstract

Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

TL;DR

This work proposes OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability, and demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety of GeMM workloads, while achieving 4.68 TOPS/W system efficiency.

Abstract

Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.

Paper Structure

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: OpenGeMM platform overview.
  • Figure 2: Dataflow representation in the GeMM accelerator hardware generator.
  • Figure 3: GeMM accelerator hardware generator microarchitecture. (a) Matrix multiplication that GeMM accelerator processes in one cycle with 3D spatial unrollings. (b) DotProd microarchitecture.
  • Figure 4: (a) Scenario without configuration pre-loading, input pre-fetch, and output buffering. (b) Conceptual visualizations for configuration pre-loading, input pre-fetch, and output buffering. (c) Data layout optimization example.
  • Figure 5: Utilization analysis of OpenGeMM under 500 random computational matrix sizes and different utilization enhancement techniques.
  • ...and 2 more figures