Table of Contents
Fetching ...

Bit-Width-Aware Design Environment for Few-Shot Learning on Edge AI Hardware

R. Kanda, H. L. Blevec, N. Onizawa, M. Leonardon, V. Gripon, T. Hanyu

TL;DR

This study proposes an implementation methodology of real-time few-shot learning on tiny FPGA SoCs such as the PYNQ-Z1 board with arbitrary fixed-point bit-widths, adopting the FINN framework, enabling implementations with arbitrary bit-widths.

Abstract

In this study, we propose an implementation methodology of real-time few-shot learning on tiny FPGA SoCs such as the PYNQ-Z1 board with arbitrary fixed-point bit-widths. Tensil-based conventional design environments limited hardware implementations to fixed-point bit-widths of 16 or 32 bits. To address this, we adopt the FINN framework, enabling implementations with arbitrary bit-widths. Several customizations and minor adjustments are made, including: 1.Optimization of Transpose nodes to resolve data format mismatches, 2.Addition of handling for converting the final reduce mean operation to Global Average Pooling (GAP). These adjustments allow us to reduce the bit-width while maintaining the same accuracy as the conventional realization, and achieve approximately twice the throughput in evaluations using CIFAR-10 dataset.

Bit-Width-Aware Design Environment for Few-Shot Learning on Edge AI Hardware

TL;DR

This study proposes an implementation methodology of real-time few-shot learning on tiny FPGA SoCs such as the PYNQ-Z1 board with arbitrary fixed-point bit-widths, adopting the FINN framework, enabling implementations with arbitrary bit-widths.

Abstract

In this study, we propose an implementation methodology of real-time few-shot learning on tiny FPGA SoCs such as the PYNQ-Z1 board with arbitrary fixed-point bit-widths. Tensil-based conventional design environments limited hardware implementations to fixed-point bit-widths of 16 or 32 bits. To address this, we adopt the FINN framework, enabling implementations with arbitrary bit-widths. Several customizations and minor adjustments are made, including: 1.Optimization of Transpose nodes to resolve data format mismatches, 2.Addition of handling for converting the final reduce mean operation to Global Average Pooling (GAP). These adjustments allow us to reduce the bit-width while maintaining the same accuracy as the conventional realization, and achieve approximately twice the throughput in evaluations using CIFAR-10 dataset.
Paper Structure (16 sections, 5 figures, 3 tables)

This paper contains 16 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Few-shot learning consists of three main steps: (1) training a backbone network to extract feature vectors from a dataset through backpropagation, (2) using the pretrained backbone to generate feature vectors from a support set and then training a simple classifier based on them, and (3) applying the trained backbone to a query set for inference, where classification is performed using a nearest class mean (NCM) approach.
  • Figure 2: In this study, we replace the previous hardware conversion method using Tensil with FINN. Starting from PyTorch-based pre-training (either floating-point or quantization-aware with Brevitas), models are synthesized through FINN for deployment on the PYNQ-Z1 board. This approach leverages FINN’s efficient dataflow architecture, contrasting with the sequential processing style of Tensil.
  • Figure 3: Flow from model training to hardware: A quantized model is generated using PyTorch with Brevitas brevitas, trained to specific bit-widths suitable for FPGA deployment. The trained model is exported as an ONNX file, which is then processed by FINN. Through this process, FINN applies necessary transformations to produce the final bitfile and runtime driver for the FPGA.
  • Figure 4: MatMul outputs in NHWC format, while MultiThreshold expects NCHW, requiring a Transpose operation in between. This mismatch led to improper weight transfer to the MVAU, causing processing errors. The issue was resolved by merging the nodes using AbsorbTransposeIntoMultiThreshold and inserting a Transpose afterward to ensure correct data flow for subsequent layers.
  • Figure 5: Overview of the implementation on the PYNQ-Z1 board. The backbone network is deployed on the FPGA, which extracts feature vectors from input data. These feature vectors are transferred to the CPU, where the NCM (Nearest Class Mean) classifier performs the final classification.