Table of Contents
Fetching ...

SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

Chuanning Wang, Chao Fang, Xiao Wu, Zhongfeng Wang, Jun Lin

TL;DR

This work tackles the challenge of deploying quantized multi-precision DNNs on edge devices by designing SPEED, a scalable RVV-based processor that supports 4–16 bit MP-DNN inference. It introduces customized RVV instructions (VSACFG, VSALD, VSAM, VSAC) and a programmable multi-precision tensor unit (MPTU) to expand parallelism and data reuse, coupled with a flexible mixed dataflow that tailors scheduling to CONV and MM workloads. Experimental results on 28 nm show SPEED achieving peak throughputs up to 737.9 GOPS with 1383.4 GOPS/W energy efficiency at 4-bit precision, and substantial area-efficiency gains over prior RVV solutions. The combination of customized instructions, scalable hardware, and operator-aware dataflow enables efficient MP-DNN inference for edge AI with broad applicability to CNN and Transformer workloads.

Abstract

Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multi-precision DNNs, denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. Firstly, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4-bit to 16-bit with minimized hardware overhead. Secondly, a parameterized multi-precision tensor unit is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared to prior RVV processors, with enhancements of 5.9$\sim$26.9$\times$ and 8.2$\sim$18.5$\times$ for 8-bit operator and best integer performance, respectively, which highlights SPEED's significant potential for efficient MP-DNN inference.

SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

TL;DR

This work tackles the challenge of deploying quantized multi-precision DNNs on edge devices by designing SPEED, a scalable RVV-based processor that supports 4–16 bit MP-DNN inference. It introduces customized RVV instructions (VSACFG, VSALD, VSAM, VSAC) and a programmable multi-precision tensor unit (MPTU) to expand parallelism and data reuse, coupled with a flexible mixed dataflow that tailors scheduling to CONV and MM workloads. Experimental results on 28 nm show SPEED achieving peak throughputs up to 737.9 GOPS with 1383.4 GOPS/W energy efficiency at 4-bit precision, and substantial area-efficiency gains over prior RVV solutions. The combination of customized instructions, scalable hardware, and operator-aware dataflow enables efficient MP-DNN inference for edge AI with broad applicability to CNN and Transformer workloads.

Abstract

Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multi-precision DNNs, denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. Firstly, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4-bit to 16-bit with minimized hardware overhead. Secondly, a parameterized multi-precision tensor unit is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared to prior RVV processors, with enhancements of 5.926.9 and 8.218.5 for 8-bit operator and best integer performance, respectively, which highlights SPEED's significant potential for efficient MP-DNN inference.
Paper Structure (19 sections, 2 equations, 14 figures, 3 tables)

This paper contains 19 sections, 2 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Customized vector instructions of SPEED drive enhanced computational efficiency in operators for MP-DNN inference.
  • Figure 2: Comparison of SPEED and Ara when performing an INT16 MM operator. SPEED leverages fewer instructions, requires smaller number of vector registers, and achieves fewer computing cycles over Ara.
  • Figure 3: Micro-architecture of SPEED, the proposed RVV processor.
  • Figure 4: Hierarchical architecture of multi-precision tensor core and exploited data parallelisms under 16-bit, 8-bit, and 4-bit data precision, respectively.
  • Figure 5: The pipeline stages allocation and operation precision switching method in the proposed design.
  • ...and 9 more figures