Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference
Vineet Kumar, Ajay Kumar M, Yike Li, Shreejith Shanker, Deepu John
TL;DR
The paper presents an open-source SoC that tightly couples the NVDLA accelerator with a 32-bit μRISC-V core to enable bare-metal deep learning inference on FPGA platforms. It introduces an end-to-end flow that offline-generates NVDLA configuration files and RISC-V assembly from neural networks, eliminating Linux kernel overhead. Architecture-wise, the design uses a memory-mapped bus, an arbiter, and a NVDLA wrapper with data-width conversion to support direct control of NVDLA registers from the RISC-V core, demonstrated on a ZCU102 board with 512 MB DRAM and models such as LeNet-5, ResNet-18, and ResNet-50. The evaluation shows low inference latencies for small models and highlights the nv_full configuration's superior speed but practical FPGA-resource constraints, outlining future work to broaden model support and enable broader deployment via additional compilers. Overall, the work advances edge AI by delivering a Linux-free, tightly integrated NVDLA–RISC-V solution with an automated, model-agnostic configuration workflow suitable for resource-constrained devices.
Abstract
This paper presents a novel System-on-Chip (SoC) architecture for accelerating complex deep learning models for edge computing applications through a combination of hardware and software optimisations. The hardware architecture tightly couples the open-source NVIDIA Deep Learning Accelerator (NVDLA) to a 32-bit, 4-stage pipelined RISC-V core from Codasip called uRISC_V. To offload the model acceleration in software, our toolflow generates bare-metal application code (in assembly), overcoming complex OS overheads of previous works that have explored similar architectures. This tightly coupled architecture and bare-metal flow leads to improvements in execution speed and storage efficiency, making it suitable for edge computing solutions. We evaluate the architecture on AMD's ZCU102 FPGA board using NVDLA-small configuration and test the flow using LeNet-5, ResNet-18 and ResNet-50 models. Our results show that these models can perform inference in 4.8 ms, 16.2 ms and 1.1 s respectively, at a system clock frequency of 100 MHz.
