Table of Contents
Fetching ...

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

Federico Nicolas Peccia, Svetlana Pavlitska, Tobias Fleck, Oliver Bringmann

TL;DR

The paper tackles the challenge of deploying convolutional neural networks on edge FPGA platforms by delivering an end-to-end workflow around the Gemmini accelerator. It combines hardware-aware model modifications, quantization, framework conversions, and AutoTVM-based scheduling, with a practical split of work between the PL and PS on a Xilinx Zynq SoC. The approach yields real-time YOLOv7-tiny inference with competitive energy efficiency (36.5 GOP/s/W) and demonstrates significant improvements over baseline Gemmini designs, embedded GPUs, and server GPUs, including a 85%–93% reduction in energy for comparable tasks. The work is validated through a traffic-monitoring case study, illustrating seamless integration into a broader edge-system pipeline and showcasing the practical impact for privacy-preserving, low-power edge AI deployments.

Abstract

The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

TL;DR

The paper tackles the challenge of deploying convolutional neural networks on edge FPGA platforms by delivering an end-to-end workflow around the Gemmini accelerator. It combines hardware-aware model modifications, quantization, framework conversions, and AutoTVM-based scheduling, with a practical split of work between the PL and PS on a Xilinx Zynq SoC. The approach yields real-time YOLOv7-tiny inference with competitive energy efficiency (36.5 GOP/s/W) and demonstrates significant improvements over baseline Gemmini designs, embedded GPUs, and server GPUs, including a 85%–93% reduction in energy for comparable tasks. The work is validated through a traffic-monitoring case study, illustrating seamless integration into a broader edge-system pipeline and showcasing the practical impact for privacy-preserving, low-power edge AI deployments.

Abstract

The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.
Paper Structure (19 sections, 9 figures, 4 tables)

This paper contains 19 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Proposed hardware solution, showing the improvement in the size of Gemmini's systolic array using the DSP packing technique, and a detailed description of how the DSP packing is implemented on the DSP48E2 available on Xilinx FPGAs.
  • Figure 2: End-to-end deployment workflow
  • Figure 3: Selection of the input image size
  • Figure 4: Selection of the pruned models based on mAP and parameter sparsity
  • Figure 5: AutoTVM convolution total latency per model version for the Gemmini accelerator (other kind of layers present a similar behaviour).
  • ...and 4 more figures