Table of Contents
Fetching ...

MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration

Zhiqiang Que, Jose G. F. Coutinho, Ce Guo, Hongxiang Fan, Wayne Luk

TL;DR

This work addresses the challenge of deploying high-accuracy DNNs on resource-constrained FPGA hardware by introducing MetaML-Pro, a cross-stage co-optimization framework that unifies software (DNN) optimization with high-level synthesis (HLS) design through reusable pipe tasks and a metamodel. It leverages metaprogramming to transform HLS C++ code and Bayesian optimization to automatically explore a multi-abstraction design space, guiding top-down and bottom-up optimization flows. The key contributions include a modular pipe-task library (e.g., PRUNING, SCALING, QUANTIZATION), a metaprogramming-enabled HLS optimization loop, and a Bayesian design-space exploration strategy that significantly reduces resource usage (up to $92\%$ DSP and $89\%$ LUT) while maintaining accuracy, with a reported $15.6\times$ speedup over grid search. This approach enables automated, customizable generation of resource-efficient FPGA-based DNN accelerators and lays the groundwork for extending to RTL and broader DNN families in practical deployment scenarios.

Abstract

This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two core key issues: (i)~encoding custom optimization strategies and (ii)~enabling cross-stage optimization search. In particular, our proposed framework seamlessly integrates programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming, leveraging advanced design space exploration (DSE) strategies like Bayesian optimization to automate both top-down and bottom-up design flows. Hence, we reduce the need for manual intervention and domain expertise. In addition, the framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92\% DSP and 89\% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results highlight the potential for automating the generation of resource-efficient DNN accelerator designs with minimum effort.

MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration

TL;DR

This work addresses the challenge of deploying high-accuracy DNNs on resource-constrained FPGA hardware by introducing MetaML-Pro, a cross-stage co-optimization framework that unifies software (DNN) optimization with high-level synthesis (HLS) design through reusable pipe tasks and a metamodel. It leverages metaprogramming to transform HLS C++ code and Bayesian optimization to automatically explore a multi-abstraction design space, guiding top-down and bottom-up optimization flows. The key contributions include a modular pipe-task library (e.g., PRUNING, SCALING, QUANTIZATION), a metaprogramming-enabled HLS optimization loop, and a Bayesian design-space exploration strategy that significantly reduces resource usage (up to DSP and LUT) while maintaining accuracy, with a reported speedup over grid search. This approach enables automated, customizable generation of resource-efficient FPGA-based DNN accelerators and lays the groundwork for extending to RTL and broader DNN families in practical deployment scenarios.

Abstract

This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two core key issues: (i)~encoding custom optimization strategies and (ii)~enabling cross-stage optimization search. In particular, our proposed framework seamlessly integrates programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming, leveraging advanced design space exploration (DSE) strategies like Bayesian optimization to automate both top-down and bottom-up design flows. Hence, we reduce the need for manual intervention and domain expertise. In addition, the framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92\% DSP and 89\% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results highlight the potential for automating the generation of resource-efficient DNN accelerator designs with minimum effort.

Paper Structure

This paper contains 29 sections, 3 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: The proposed approach.
  • Figure 2: A typical FPGA design flow. This FPGA design flow begins with defining the specifications (SPEC), followed by implementing and testing the design in software (SW). If High-Level Synthesis (HLS) is used, high-level code (e.g., C/C++) is converted into hardware description language (HDL). The design then moves to the Register Transfer Level (RTL) stage, where it is described in HDL (e.g., Verilog/VHDL) and synthesized into a netlist of logic gates. Finally, the netlist is used to generate a bitstream, which is loaded onto the FPGA to configure the hardware. Each stage progressively refines the design from concept to implementation.
  • Figure 3: A connection between a $O$-task and a $K$-task. A pipe task has a uniform interface allowing any two pipe tasks to be connected (although there may be constraints about how many connections a task can handle). A $O$-task typically enhances DNN models based on specific objectives and constraints. A $K$-task on the other hand, manages the control flow. Each connection defines a unidirectional stream between a source task and a target task.
  • Figure 4: This figure illustrates the implementation of our co-optimization framework, featuring an organized system of optimization spaces, each autonomously running Python programs within dedicated environments, overseen by an exploration space executing a general optimization strategy. The diagram depicts the orchestration of local optimization spaces, such as software and hardware, through a controller process.
  • Figure 5: The proposed QHS algorithm
  • ...and 15 more figures