Table of Contents
Fetching ...

ONNX-to-Hardware Design Flow for Adaptive Neural-Network Inference on FPGAs

Federico Manca, Francesco Ratto, Francesca Palumbo

TL;DR

The paper addresses edge CPS needs for diverse, energy-efficient NN inference on resource-limited FPGAs by proposing an ONNX-to-Hardware design flow that supports quantized CNNs via QONNX and introduces data- and computation-approximation for adaptivity. It combines a streaming CNN accelerator template with dataflow-driven HLS generation and a coarse-reconfigurability pipeline (MDC) to enable runtime profile switching. Quantization-aware training with mixed-precision profiles demonstrates trade-offs between accuracy and power, while runtime merging via MDC yields an adaptive inference engine capable of power reduction with minimal accuracy loss. The approach aims to enable flexible, adaptive edge inference for CPS, with plans to scale to more complex models and datasets under EU MYRTUS.

Abstract

The challenges involved in executing neural networks (NNs) at the edge include providing diversity, flexibility, and sustainability. That implies, for instance, supporting evolving applications and algorithms energy-efficiently. Using hardware or software accelerators can deliver fast and efficient computation of the NNs, while flexibility can be exploited to support long-term adaptivity. Nonetheless, handcrafting an NN for a specific device, despite the possibility of leading to an optimal solution, takes time and experience, and that's why frameworks for hardware accelerators are being developed. This work, starting from a preliminary semi-integrated ONNX-to-hardware toolchain [21], focuses on enabling approximate computing leveraging the distinctive ability of the original toolchain to favor adaptivity. The goal is to allow lightweight adaptable NN inference on FPGAs at the edge.

ONNX-to-Hardware Design Flow for Adaptive Neural-Network Inference on FPGAs

TL;DR

The paper addresses edge CPS needs for diverse, energy-efficient NN inference on resource-limited FPGAs by proposing an ONNX-to-Hardware design flow that supports quantized CNNs via QONNX and introduces data- and computation-approximation for adaptivity. It combines a streaming CNN accelerator template with dataflow-driven HLS generation and a coarse-reconfigurability pipeline (MDC) to enable runtime profile switching. Quantization-aware training with mixed-precision profiles demonstrates trade-offs between accuracy and power, while runtime merging via MDC yields an adaptive inference engine capable of power reduction with minimal accuracy loss. The approach aims to enable flexible, adaptive edge inference for CPS, with plans to scale to more complex models and datasets under EU MYRTUS.

Abstract

The challenges involved in executing neural networks (NNs) at the edge include providing diversity, flexibility, and sustainability. That implies, for instance, supporting evolving applications and algorithms energy-efficiently. Using hardware or software accelerators can deliver fast and efficient computation of the NNs, while flexibility can be exploited to support long-term adaptivity. Nonetheless, handcrafting an NN for a specific device, despite the possibility of leading to an optimal solution, takes time and experience, and that's why frameworks for hardware accelerators are being developed. This work, starting from a preliminary semi-integrated ONNX-to-hardware toolchain [21], focuses on enabling approximate computing leveraging the distinctive ability of the original toolchain to favor adaptivity. The goal is to allow lightweight adaptable NN inference on FPGAs at the edge.
Paper Structure (15 sections, 4 figures, 1 table)

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Representation on a simple CNN and its mapping to a streaming archietecture.
  • Figure 2: On the left, the ONNX-to-Hardware design flow for the generation of adaptive neural-network inference engines on FPGAs. The training library could be any library able to export to QONNX. On the right, the streaming-based template architecture for a convolutional layer.
  • Figure 3: Accuracy VS power chart of the obtained profiles. In green the Mixed design. The yellow arrows point to the two configurations selected for adaptivity.
  • Figure 4: On top the resource utilization of the adaptive engine and its perfomance metrics under different profiles. On the left side the architecture of a complete adaptable systems that exploits the proposed adaptive inference engine. On the right side, a comparison of the resulting battery duration (supposing a 10Ah energy budget) and number of classifications executable by the adaptive engine and a non-adaptive one supporting the higher accuracy profile only.