Table of Contents
Fetching ...

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

Longwei Huang, Chao Fang, Qiong Li, Jun Lin, Zhongfeng Wang

TL;DR

This work proposes a precision-scalable RISC-V DNN processor with on-device learning capability that facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations.

Abstract

Extreme edge platforms, such as in-vehicle smart devices, require efficient deployment of quantized deep neural networks (DNNs) to enable intelligent applications with limited amounts of energy, memory, and computing resources. However, many edge devices struggle to boost inference throughput of various quantized DNNs due to the varying quantization levels, and these devices lack floating-point (FP) support for on-device learning, which prevents them from improving model accuracy while ensuring data privacy. To tackle the challenges above, we propose a precision-scalable RISC-V DNN processor with on-device learning capability. It facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations. Moreover, we employ multiple methods such as FP16 multiplier reuse and multi-precision integer multiplier reuse, along with balanced mapping of FPGA resources, to significantly improve hardware resource utilization. Experimental results on the Xilinx ZCU102 FPGA show that our processor significantly improves inference throughput by 1.6$\sim$14.6$\times$ and energy efficiency by 1.1$\sim$14.6$\times$ across various DNNs, compared to the prior art, XpulpNN. Additionally, our processor achieves a 16.5$\times$ higher FP throughput for on-device learning.

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

TL;DR

This work proposes a precision-scalable RISC-V DNN processor with on-device learning capability that facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations.

Abstract

Extreme edge platforms, such as in-vehicle smart devices, require efficient deployment of quantized deep neural networks (DNNs) to enable intelligent applications with limited amounts of energy, memory, and computing resources. However, many edge devices struggle to boost inference throughput of various quantized DNNs due to the varying quantization levels, and these devices lack floating-point (FP) support for on-device learning, which prevents them from improving model accuracy while ensuring data privacy. To tackle the challenges above, we propose a precision-scalable RISC-V DNN processor with on-device learning capability. It facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations. Moreover, we employ multiple methods such as FP16 multiplier reuse and multi-precision integer multiplier reuse, along with balanced mapping of FPGA resources, to significantly improve hardware resource utilization. Experimental results on the Xilinx ZCU102 FPGA show that our processor significantly improves inference throughput by 1.614.6 and energy efficiency by 1.114.6 across various DNNs, compared to the prior art, XpulpNN. Additionally, our processor achieves a 16.5 higher FP throughput for on-device learning.
Paper Structure (13 sections, 8 figures, 1 table)

This paper contains 13 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Comparison between the architecture of (a) XpulpNN garofalo2021xpulpnn and (b) our proposed DNN processor.
  • Figure 2: The instruction and computation flow of our processor and XpulpNN to perform an INT8 matrix multiplication operator. (a) shows the 4$\times$4 matmul operator; (b) and (c) show the computational and instruction flows of our processor's SA and XpulpNN's dotp units, respectively.
  • Figure 3: Data arrangement method of different precision.
  • Figure 4: Architecture of the precision-scalable multiplier with highly-reused 16-bit mantissa multiplier and 8-bit multiplier trees. Only half of the 4-bit multiplier trees and 2-bit multipliers of one 8-bit multiplier tree are reused to ensure the output bit-width remains the same at different precision levels.
  • Figure 5: Architecture of the precision-scalable adder.
  • ...and 3 more figures