Table of Contents
Fetching ...

PEFSL: A deployment Pipeline for Embedded Few-Shot Learning on a FPGA SoC

Lucas Grativol Ribeiro, Lubin Gauthier, Mathieu Leonardon, Jérémy Morlier, Antoine Lavrard-Meyer, Guillaume Muller, Virginie Fresse, Matthieu Arzel

TL;DR

The paper tackles the barrier of deploying few-shot learning on energy- and latency-constrained FPGA SoCs by delivering an end-to-end open-source pipeline built on the Tensil framework, along with a low-power demonstrator for real-time object classification. It selects compact ResNet backbones (notably ResNet-9) and conducts extensive design-space exploration across input resolutions, downsampling, and network width to meet embedded constraints on MiniImageNet. The key contributions include the PEFSL pipeline for training, ONNX export, RTL generation, and FPGA deployment; a demonstrator achieving around $30\mathrm{ms}$ latency at $6.2\ \mathrm{W}$ on a PYNQ-Z1, and an analysis showing favorable latency-accuracy trade-offs. This work enables rapid, open, on-device adaptation for robotics, drones, and autonomous systems where real-time, energy-efficient few-shot inference is essential.

Abstract

This paper tackles the challenges of implementing few-shot learning on embedded systems, specifically FPGA SoCs, a vital approach for adapting to diverse classification tasks, especially when the costs of data acquisition or labeling prove to be prohibitively high. Our contributions encompass the development of an end-to-end open-source pipeline for a few-shot learning platform for object classification on a FPGA SoCs. The pipeline is built on top of the Tensil open-source framework, facilitating the design, training, evaluation, and deployment of DNN backbones tailored for few-shot learning. Additionally, we showcase our work's potential by building and deploying a low-power, low-latency demonstrator trained on the MiniImageNet dataset with a dataflow architecture. The proposed system has a latency of 30 ms while consuming 6.2 W on the PYNQ-Z1 board.

PEFSL: A deployment Pipeline for Embedded Few-Shot Learning on a FPGA SoC

TL;DR

The paper tackles the barrier of deploying few-shot learning on energy- and latency-constrained FPGA SoCs by delivering an end-to-end open-source pipeline built on the Tensil framework, along with a low-power demonstrator for real-time object classification. It selects compact ResNet backbones (notably ResNet-9) and conducts extensive design-space exploration across input resolutions, downsampling, and network width to meet embedded constraints on MiniImageNet. The key contributions include the PEFSL pipeline for training, ONNX export, RTL generation, and FPGA deployment; a demonstrator achieving around latency at on a PYNQ-Z1, and an analysis showing favorable latency-accuracy trade-offs. This work enables rapid, open, on-device adaptation for robotics, drones, and autonomous systems where real-time, energy-efficient few-shot inference is essential.

Abstract

This paper tackles the challenges of implementing few-shot learning on embedded systems, specifically FPGA SoCs, a vital approach for adapting to diverse classification tasks, especially when the costs of data acquisition or labeling prove to be prohibitively high. Our contributions encompass the development of an end-to-end open-source pipeline for a few-shot learning platform for object classification on a FPGA SoCs. The pipeline is built on top of the Tensil open-source framework, facilitating the design, training, evaluation, and deployment of DNN backbones tailored for few-shot learning. Additionally, we showcase our work's potential by building and deploying a low-power, low-latency demonstrator trained on the MiniImageNet dataset with a dataflow architecture. The proposed system has a latency of 30 ms while consuming 6.2 W on the PYNQ-Z1 board.
Paper Structure (17 sections, 5 figures, 1 table)

This paper contains 17 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our few-shot learning method.
  • Figure 2: Structure of a ResNet-9, where initial layers employ 16 output feature maps, and subsequent layers scale their output channels accordingly.
  • Figure 3: Modular pipeline for the deployment of a few-shot learning system on an FPGA SoC.
  • Figure 4: Schematic of the system.
  • Figure 5: Accuracy and Latency Trade-off: Graphs depict tests on $32\times32$ (top) and $84\times84$ (bottom) images. Different feature maps configurations are denoted by unique colors, while distinct training image sizes are represented by different shapes. We also investigate the impact of strided architectures, differentiated by dark and light colors. Additionally, we vary the backbone architecture from ResNet-9, with empty forms, and ResNet-12, filled forms.