Table of Contents
Fetching ...

RISC-V R-Extension: Advancing Efficiency with Rented-Pipeline for Edge DNN Processing

Won Hyeok Kim, Hyeong Jin Kim, Tae Hee Han

TL;DR

Edge devices require energy-efficient DNN inference within tight power and area constraints, making traditional NPUs impractical for small form factors. The paper proposes the RISC-V R-extension, which combines rented-pipeline execution and Architectural Pipeline Registers (APR) with new instructions rfmac.s and rfsmac.s to accelerate MAC operations on CPU cores. Across LeNet, ResNet-20, and MobileNet-V1, RV64R delivers IPC improvements up to 29% over RV64F and reduces memory accesses by up to 34%, with runtime gains around 50% versus RV64F and ~32% vs Baseline, while incurring only modest hardware overhead in FPGA implementations. These results indicate a viable, low-overhead CPU-based path for edge AI that can scale with future vector extensions, enabling more responsive and power-efficient edge applications.

Abstract

The proliferation of edge devices necessitates efficient computational architectures for lightweight tasks, particularly deep neural network (DNN) inference. Traditional NPUs, though effective for such operations, face challenges in power, cost, and area when integrated into lightweight edge devices. The RISC-V architecture, known for its modularity and open-source nature, offers a viable alternative. This paper introduces the RISC-V R-extension, a novel approach to enhancing DNN process efficiency on edge devices. The extension features rented-pipeline stages and architectural pipeline registers (APR), which optimize critical operation execution, thereby reducing latency and memory access frequency. Furthermore, this extension includes new custom instructions to support these architectural improvements. Through comprehensive analysis, this study demonstrates the boost of R-extension in edge device processing, setting the stage for more responsive and intelligent edge applications.

RISC-V R-Extension: Advancing Efficiency with Rented-Pipeline for Edge DNN Processing

TL;DR

Edge devices require energy-efficient DNN inference within tight power and area constraints, making traditional NPUs impractical for small form factors. The paper proposes the RISC-V R-extension, which combines rented-pipeline execution and Architectural Pipeline Registers (APR) with new instructions rfmac.s and rfsmac.s to accelerate MAC operations on CPU cores. Across LeNet, ResNet-20, and MobileNet-V1, RV64R delivers IPC improvements up to 29% over RV64F and reduces memory accesses by up to 34%, with runtime gains around 50% versus RV64F and ~32% vs Baseline, while incurring only modest hardware overhead in FPGA implementations. These results indicate a viable, low-overhead CPU-based path for edge AI that can scale with future vector extensions, enabling more responsive and power-efficient edge applications.

Abstract

The proliferation of edge devices necessitates efficient computational architectures for lightweight tasks, particularly deep neural network (DNN) inference. Traditional NPUs, though effective for such operations, face challenges in power, cost, and area when integrated into lightweight edge devices. The RISC-V architecture, known for its modularity and open-source nature, offers a viable alternative. This paper introduces the RISC-V R-extension, a novel approach to enhancing DNN process efficiency on edge devices. The extension features rented-pipeline stages and architectural pipeline registers (APR), which optimize critical operation execution, thereby reducing latency and memory access frequency. Furthermore, this extension includes new custom instructions to support these architectural improvements. Through comprehensive analysis, this study demonstrates the boost of R-extension in edge device processing, setting the stage for more responsive and intelligent edge applications.
Paper Structure (12 sections, 6 figures, 4 tables)

This paper contains 12 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) RV64F, (b) Baseline, (c) RV64R. Left: Convolution code, H$_{\text{in}}$: Input Height, W$_{\text{in}}$: Input Width, M: Number of Filter, C: Channel, H$_{\text{fil}}$: Filter Height, W$_{\text{fil}}$: Filter Width. Right: Assembly language after compile. Highlighted parts are the main instructions in the most inner of all loops.
  • Figure 2: Forwarding effect on MAC at R-extension.
  • Figure 3: Instruction format of F-extension (fmul.s), Baseline (fmac.s), and R-extension (rfmac.s, rfsmac.s).
  • Figure 4: MASK and MATCH of F-extension, Baseline, and R-extension.
  • Figure 5: Dataflow of Baseline and R-extension with R_EX and APR.
  • ...and 1 more figures