A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

Chuanning Wang; Chao Fang; Xiao Wu; Zhongfeng Wang; Jun Lin

A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

Chuanning Wang, Chao Fang, Xiao Wu, Zhongfeng Wang, Jun Lin

TL;DR

This work tackles the challenge of deploying multi-precision quantized DNNs on RISC-V by introducing SPEED, a scalable RVV processor that supports $4$ to $16$ bit precision via customized instructions, a parameterized systolic array unit, and a mixed dataflow strategy. It integrates a dedicated vector instruction decode path, a flexible dataflow mapping, and a high-parallelism SAU to maximize data reuse and throughput. Experimental results on a 28 nm process show SPEED achieving up to $287.41$ GOPS peak throughput and $1335.79$ GOPS/W at $4$-bit precision, with substantial area efficiency gains over the pioneer Ara across 16-bit and 8-bit precisions, highlighting SPEED’s potential for scalable, energy-efficient multi-precision DNN inference on RVV platforms.

Abstract

RISC-V processors encounter substantial challenges in deploying multi-precision deep neural networks (DNNs) due to their restricted precision support, constrained throughput, and suboptimal dataflow design. To tackle these challenges, a scalable RISC-V vector (RVV) processor, namely SPEED, is proposed to enable efficient multi-precision DNN inference by innovations from customized instructions, hardware architecture, and dataflow mapping. Firstly, dedicated customized RISC-V instructions are proposed based on RVV extensions, providing SPEED with fine-grained control over processing precision ranging from 4 to 16 bits. Secondly, a parameterized multi-precision systolic array unit is incorporated within the scalable module to enhance parallel processing capability and data reuse opportunities. Finally, a mixed multi-precision dataflow strategy, compatible with different convolution kernels and data precision, is proposed to effectively improve data utilization and computational efficiency. We perform synthesis of SPEED in TSMC 28nm technology. The experimental results demonstrate that SPEED achieves a peak throughput of 287.41 GOPS and an energy efficiency of 1335.79 GOPS/W at 4-bit precision condition, respectively. Moreover, when compared to the pioneer open-source vector processor Ara, SPEED provides an area efficiency improvement of 2.04$\times$ and 1.63$\times$ under 16-bit and 8-bit precision conditions, respectively, which shows SPEED's significant potential for efficient multi-precision DNN inference.

A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

TL;DR

This work tackles the challenge of deploying multi-precision quantized DNNs on RISC-V by introducing SPEED, a scalable RVV processor that supports

bit precision via customized instructions, a parameterized systolic array unit, and a mixed dataflow strategy. It integrates a dedicated vector instruction decode path, a flexible dataflow mapping, and a high-parallelism SAU to maximize data reuse and throughput. Experimental results on a 28 nm process show SPEED achieving up to

GOPS peak throughput and

GOPS/W at

-bit precision, with substantial area efficiency gains over the pioneer Ara across 16-bit and 8-bit precisions, highlighting SPEED’s potential for scalable, energy-efficient multi-precision DNN inference on RVV platforms.

Abstract

and 1.63

under 16-bit and 8-bit precision conditions, respectively, which shows SPEED's significant potential for efficient multi-precision DNN inference.

Paper Structure (10 sections, 5 figures, 1 table)

This paper contains 10 sections, 5 figures, 1 table.

Introduction
The Proposed SPEED Architecture
Customized Instructions
Hardware Architecture
Dataflow Mapping
Experimental Results
Experimental Setup
Model Evaluation
Analysis of Synthesized Results
Conclusion

Figures (5)

Figure 1: Customized instructions and overall architecture of SPEED.
Figure 2: Examples on how CF strategy and FF strategy work with multi-precision elements. Note that the 4-bit data is operated in the same way as 16-bit and 8-bit.
Figure 3: Layer-wise area efficiency breakdown of GoogLeNet on SPEED under 16-bit precision. Our mixed dataflow strategy surpasses the FF-only and CF-only strategies by 1.88$\times$ and 1.38$\times$, respectively.
Figure 4: Average area efficiency under multi-precision DNN benchmarks, SPEED outperforms Ara by 2.77$\times$ and 6.39$\times$ at 16-bit and 8-bit precision, respectively.
Figure 5: Area Breakdown of (a) SPEED and (b) a single lane. SAU occupies only 26% of the area in a single lane while achieving significant computational performance.

A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

TL;DR

Abstract

A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (5)