HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

Petros Toupas; Alexander Montgomerie-Corcoran; Christos-Savvas Bouganis; Dimitrios Tzovaras

HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

Petros Toupas, Alexander Montgomerie-Corcoran, Christos-Savvas Bouganis, Dimitrios Tzovaras

TL;DR

HARFLOW3D tackles the latency-optimized deployment of 3D CNNs for HAR on FPGAs by introducing an automated, streaming-architecture toolflow that maps ONNX models to FPGA designs using a SDFG-based representation and latency-aware design-space exploration. The approach combines a neural-network parser, detailed hardware-building blocks with runtime configurability, performance and resource models, and simulated-annealing-based scheduling to produce pareto-optimal latency-accuracy designs across diverse models and devices. Key contributions include the ONNX-to-SDFG parser, a rich set of parameterizable hardware blocks, and a comprehensive validation across multiple 3D HAR models and FPGA platforms, showing competitive latency against hand-tuned baselines and up to significant improvements in certain cases. Overall, HARFLOW3D demonstrates substantial potential for practical, low-latency HAR on FPGA, enabling broader adoption of 3D-CNN HAR on resource-constrained hardware and paving the way for extensions to other 3D domains.

Abstract

For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks have proven to be highly effective, achieving state-of-the-art results. This study introduces a novel streaming architecture based toolflow for mapping such models onto FPGAs considering the model's inherent characteristics and the features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics, generating a design that minimizes the latency of the computation. The toolflow is comprised of a number of parts, including i) a 3D CNN parser, ii) a performance and resource model, iii) a scheduling algorithm for executing 3D models on the generated hardware, iv) a resource-aware optimization engine tailored for 3D models, v) an automated mapping to synthesizable code for FPGAs. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs. Furthermore, the toolflow has produced high-performing results for 3D CNN models that have not been mapped to FPGAs before, demonstrating the potential of FPGA-based systems in this space. Overall, HARFLOW3D has demonstrated its ability to deliver competitive latency compared to a range of state-of-the-art hand-tuned approaches being able to achieve up to 5$\times$ better performance compared to some of the existing works.

HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

TL;DR

Abstract

better performance compared to some of the existing works.

Paper Structure (24 sections, 20 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 20 equations, 8 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Proposed Architecture
Neural Network Model Parser
Building Blocks Description
Hardware Design and Implementation
Modelling
Performance Modelling
Resource Modelling
Latency-Driven Design Space Exploration
Scheduling
Optimization Strategy
Transformations
Feature-Map Dimensions Reshaping
Coarse-grain Folding
...and 9 more sections

Figures (8)

Figure 1: Pareto front on 3D CNNs: Latency over Accuracy. Designs produced by the proposed HARFLOW3D toolflow dominate the pareto front.
Figure 2: Block diagram of an accelerator instance produced by the HARFLOW3D toolflow. The black lines describe AXI-Stream signals, where the arrows indicate the directionality of the connection, blue are high-throughput AXI interfaces for DMA access, red are AXI-Lite connections for runtime configuration of the hardware nodes, and green indicate the DDR IO interfaces for communicating with off-chip memory.
Figure 3: Diagram of hardware for Convolution, and how it can be used with runtime parameters. The blue blocks represent compile-time configurable hardware modules. The red blocks represent runtime configurable hardware modules. Cross-hatching gives an example of how hardware elements can be bypassed at runtime.
Figure 4: Evolution of latency during Simulated Annealing for various FPGA devices.
Figure 5: The Dataflow of a simple design consisting of Convolution, ReLU, and FC layers. As the red lines and the crossbar dictate the flow between Convolution and ReLU can be addressed within the FPGA without sending the data back to the off-chip memory. This is the result of Fuse Activation optimization.
...and 3 more figures

HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

TL;DR

Abstract

HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (8)