Leveraging Application-Specific Knowledge for Energy-Efficient Deep Learning Accelerators on Resource-Constrained FPGAs
Chao Qian
TL;DR
This work addresses energy-efficient deep learning inference on resource-constrained FPGAs for IoT by proposing a Generator-driven framework that combines optimized RTL templates, workload-aware execution strategies, and application-specific knowledge. The approach automates design-space exploration under resource and latency constraints, prioritizing energy efficiency, and is validated through a three-part methodology (inputs, Generator, evaluation) supported by software and real-hardware testing on the Elastic Node platform. Key contributions include a structured methodology, progress in RTL template optimization (e.g., LSTM latency and energy gains), and workload-adaptive strategies for regular and irregular workloads, alongside a literature review identifying critical gaps. Early results show substantial improvements, such as latency reductions from $53.32 μs$ to $28.07 μs$ and energy-efficiency gains from $5.57$ to $12.98$ GOPS/s/W, plus effective Idle-Waiting and adaptive-threshold strategies, demonstrating the feasibility of automated, application-aware FPGA DL accelerators for energy-constrained IoT deployments.
Abstract
The growing adoption of Deep Learning (DL) applications in the Internet of Things has increased the demand for energy-efficient accelerators. Field Programmable Gate Arrays (FPGAs) offer a promising platform for such acceleration due to their flexibility and power efficiency. However, deploying DL models on resource-constrained FPGAs remains challenging because of limited resources, workload variability, and the need for energy-efficient operation. This paper presents a framework for generating energy-efficient DL accelerators on resource-constrained FPGAs. The framework systematically explores design configurations to enhance energy efficiency while meeting requirements for resource utilization and inference performance in diverse application scenarios. The contributions of this work include: (1) analyzing challenges in achieving energy efficiency on resource-constrained FPGAs; (2) proposing a methodology for designing DL accelerators with integrated Register Transfer Level (RTL) optimizations, workload-aware strategies, and application-specific knowledge; and (3) conducting a literature review to identify gaps and demonstrate the necessity of this work.
