A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications
Panagiotis Mousouliotis, Georgios Keramidas
TL;DR
The paper tackles the challenge of deploying CNNs for embedded deep learning on resource-limited FPGA SoCs by introducing a parameterizable CNN accelerator template implemented via high-level synthesis. The design comprises a two-part architecture (CONV-PART and MPOOL-PART) with an 8-bit dynamic fixed-point quantization scheme and extensive HW/SW co-design optimizations, enabling latency, power, and area trade-offs across a range of CNN workloads. Key contributions include the detailed accelerator architecture, quantization strategy, HLS-driven parallelism, and an extensive evaluation showing favorable latency and power compared with related works. The work demonstrates a flexible, scalable approach for embedded DL that can be extended to other DL applications and workload families.
Abstract
Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
