FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices

Arya Parameshwara; Santosh Hanamappa Mokashi

FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices

Arya Parameshwara, Santosh Hanamappa Mokashi

TL;DR

This work tackles the challenge of delivering high-performance yet programmable CNN inference on edge devices by co-designing FPGA NN accelerators with custom RISC-V ISA extensions. It introduces four extensions—FPGA.VCONV, FPGA.GEMM, FPGA.RELU, and FPGA.CUSTOM—and demonstrates a complete, timing-closed system on the PYNQ-Z2 that achieves a $2.14\times$ average latency reduction and a $49.1\%$ energy reduction compared with an optimized ARM baseline under a 50 MHz clock. The results show meaningful speedups for convolution and matrix multiplication through 4×4 and 8×8 systolic arrays, respectively, with careful memory and DMA overlap considerations; the system uses only $0.43\%$ LUTs and $11.4\%$ BRAM for the base core and $38.8\%$ DSPs when accelerators are active, and provides a reproducible, open-source framework for ISA-guided FPGA acceleration. Overall, the paper demonstrates a practical, programmable edge-AI platform that bridges software flexibility and accelerator efficiency, enabling energy-conscious edge deployments where fixed-function ASICs trade off post-deployment adaptability.

Abstract

Edge AI deployment faces critical challenges balancing computational performance, energy efficiency, and resource constraints. This paper presents FPGA-accelerated RISC-V instruction set architecture (ISA) extensions for efficient neural network inference on resource-constrained edge devices. We introduce a custom RISC-V core with four novel ISA extensions (FPGA.VCONV, FPGA.GEMM, FPGA.RELU, FPGA.CUSTOM) and integrated neural network accelerators, implemented and validated on the Xilinx PYNQ-Z2 platform. The complete system achieves 2.14x average latency speedup and 49.1% energy reduction versus an ARM Cortex-A9 software baseline across four benchmark models (MobileNet V2, ResNet-18, EfficientNet Lite, YOLO Tiny). Hardware implementation closes timing with +12.793 ns worst negative slack at 50 MHz while using 0.43% LUTs and 11.4% BRAM for the base core and 38.8% DSPs when accelerators are active. Hardware verification confirms successful FPGA deployment with verified 64 KB BRAM memory interface and AXI interconnect functionality. All performance metrics are obtained from physical hardware measurements. This work establishes a reproducible framework for ISA-guided FPGA acceleration that complements fixed-function ASICs by trading peak performance for programmability.

FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices

TL;DR

Abstract

FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)