HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices
Petros Toupas, Alexander Montgomerie-Corcoran, Christos-Savvas Bouganis, Dimitrios Tzovaras
TL;DR
HARFLOW3D tackles the latency-optimized deployment of 3D CNNs for HAR on FPGAs by introducing an automated, streaming-architecture toolflow that maps ONNX models to FPGA designs using a SDFG-based representation and latency-aware design-space exploration. The approach combines a neural-network parser, detailed hardware-building blocks with runtime configurability, performance and resource models, and simulated-annealing-based scheduling to produce pareto-optimal latency-accuracy designs across diverse models and devices. Key contributions include the ONNX-to-SDFG parser, a rich set of parameterizable hardware blocks, and a comprehensive validation across multiple 3D HAR models and FPGA platforms, showing competitive latency against hand-tuned baselines and up to significant improvements in certain cases. Overall, HARFLOW3D demonstrates substantial potential for practical, low-latency HAR on FPGA, enabling broader adoption of 3D-CNN HAR on resource-constrained hardware and paving the way for extensions to other 3D domains.
Abstract
For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks have proven to be highly effective, achieving state-of-the-art results. This study introduces a novel streaming architecture based toolflow for mapping such models onto FPGAs considering the model's inherent characteristics and the features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics, generating a design that minimizes the latency of the computation. The toolflow is comprised of a number of parts, including i) a 3D CNN parser, ii) a performance and resource model, iii) a scheduling algorithm for executing 3D models on the generated hardware, iv) a resource-aware optimization engine tailored for 3D models, v) an automated mapping to synthesizable code for FPGAs. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs. Furthermore, the toolflow has produced high-performing results for 3D CNN models that have not been mapped to FPGAs before, demonstrating the potential of FPGA-based systems in this space. Overall, HARFLOW3D has demonstrated its ability to deliver competitive latency compared to a range of state-of-the-art hand-tuned approaches being able to achieve up to 5$\times$ better performance compared to some of the existing works.
