MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration
Zhiqiang Que, Jose G. F. Coutinho, Ce Guo, Hongxiang Fan, Wayne Luk
TL;DR
This work addresses the challenge of deploying high-accuracy DNNs on resource-constrained FPGA hardware by introducing MetaML-Pro, a cross-stage co-optimization framework that unifies software (DNN) optimization with high-level synthesis (HLS) design through reusable pipe tasks and a metamodel. It leverages metaprogramming to transform HLS C++ code and Bayesian optimization to automatically explore a multi-abstraction design space, guiding top-down and bottom-up optimization flows. The key contributions include a modular pipe-task library (e.g., PRUNING, SCALING, QUANTIZATION), a metaprogramming-enabled HLS optimization loop, and a Bayesian design-space exploration strategy that significantly reduces resource usage (up to $92\%$ DSP and $89\%$ LUT) while maintaining accuracy, with a reported $15.6\times$ speedup over grid search. This approach enables automated, customizable generation of resource-efficient FPGA-based DNN accelerators and lays the groundwork for extending to RTL and broader DNN families in practical deployment scenarios.
Abstract
This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two core key issues: (i)~encoding custom optimization strategies and (ii)~enabling cross-stage optimization search. In particular, our proposed framework seamlessly integrates programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming, leveraging advanced design space exploration (DSE) strategies like Bayesian optimization to automate both top-down and bottom-up design flows. Hence, we reduce the need for manual intervention and domain expertise. In addition, the framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92\% DSP and 89\% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results highlight the potential for automating the generation of resource-efficient DNN accelerator designs with minimum effort.
