Table of Contents
Fetching ...

An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation

Weichuang Zhang, Jieru Zhao, Guan Shen, Quan Chen, Chen Chen, Minyi Guo

TL;DR

This work tackles the difficulty of generating high-performance FPGA accelerators by introducing POM, an open-source optimization framework on MLIR that uses three layered intermediate representations (dependence graph IR, polyhedral IR, and annotated MLIR affine dialect) together with a declarative DSL and a design space exploration (DSE) engine. By integrating the polyhedral model for dependence analysis and loop transformations, POM enables FPGA-oriented optimizations that are difficult to achieve with single-IR approaches. The framework delivers substantial speedups over prior state-of-the-art tools (averages around $6.46\\times$ on typical benchmarks and $6.06\\times$ on real-world workloads) and supports broad domains including image processing and deep learning, while maintaining efficient resource use. This approach improves productivity and scalability for FPGA accelerator development and is open-sourced for broader adoption.

Abstract

With the increasing demand for computing capability given limited resource and power budgets, it is crucial to deploy applications to customized accelerators like FPGAs. However, FPGA programming is non-trivial. Although existing high-level synthesis (HLS) tools improve productivity to a certain extent, they are limited in scope and capability to support sufficient FPGA-oriented optimizations. This paper focuses on FPGA-based accelerators and proposes POM, an optimizing framework built on multi-level intermediate representation (MLIR). POM has several features which demonstrate its scope and capability of performance optimization. First, most HLS tools depend exclusively on a single-level IR to perform all the optimizations, introducing excessive information into the IR and making debugging an arduous task. In contrast, POM introduces three layers of IR to perform operations at suitable abstraction levels, streamlining the implementation and debugging process and exhibiting better flexibility, extensibility, and systematicness. Second, POM integrates the polyhedral model into MLIR, enabling advanced dependence analysis and various FPGA-oriented loop transformations. By representing nested loops with integer sets and maps, loop transformations can be conducted conveniently through manipulations on polyhedral semantics. Finally, to further relieve design effort, POM has a user-friendly programming interface (DSL) that allows a concise description of computation and includes a rich collection of scheduling primitives. An automatic design space exploration (DSE) engine is provided to search for high-performance optimization schemes efficiently and generate optimized accelerators automatically. Experimental results show that POM achieves a $6.46\times$ average speedup on typical benchmark suites and a $6.06\times$ average speedup on real-world applications compared to the state-of-the-art.

An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation

TL;DR

This work tackles the difficulty of generating high-performance FPGA accelerators by introducing POM, an open-source optimization framework on MLIR that uses three layered intermediate representations (dependence graph IR, polyhedral IR, and annotated MLIR affine dialect) together with a declarative DSL and a design space exploration (DSE) engine. By integrating the polyhedral model for dependence analysis and loop transformations, POM enables FPGA-oriented optimizations that are difficult to achieve with single-IR approaches. The framework delivers substantial speedups over prior state-of-the-art tools (averages around on typical benchmarks and on real-world workloads) and supports broad domains including image processing and deep learning, while maintaining efficient resource use. This approach improves productivity and scalability for FPGA accelerator development and is open-sourced for broader adoption.

Abstract

With the increasing demand for computing capability given limited resource and power budgets, it is crucial to deploy applications to customized accelerators like FPGAs. However, FPGA programming is non-trivial. Although existing high-level synthesis (HLS) tools improve productivity to a certain extent, they are limited in scope and capability to support sufficient FPGA-oriented optimizations. This paper focuses on FPGA-based accelerators and proposes POM, an optimizing framework built on multi-level intermediate representation (MLIR). POM has several features which demonstrate its scope and capability of performance optimization. First, most HLS tools depend exclusively on a single-level IR to perform all the optimizations, introducing excessive information into the IR and making debugging an arduous task. In contrast, POM introduces three layers of IR to perform operations at suitable abstraction levels, streamlining the implementation and debugging process and exhibiting better flexibility, extensibility, and systematicness. Second, POM integrates the polyhedral model into MLIR, enabling advanced dependence analysis and various FPGA-oriented loop transformations. By representing nested loops with integer sets and maps, loop transformations can be conducted conveniently through manipulations on polyhedral semantics. Finally, to further relieve design effort, POM has a user-friendly programming interface (DSL) that allows a concise description of computation and includes a rich collection of scheduling primitives. An automatic design space exploration (DSE) engine is provided to search for high-performance optimization schemes efficiently and generate optimized accelerators automatically. Experimental results show that POM achieves a average speedup on typical benchmark suites and a average speedup on real-world applications compared to the state-of-the-art.
Paper Structure (28 sections, 16 figures, 7 tables)

This paper contains 28 sections, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Illustration of loop dependence analysis
  • Figure 2: Motivating example: (a) presents the code snippet of BICG; (b) compares latency and speedup achieved by different frameworks; (c)(d)(e) illustrate schedules for BICG generated by the baseline, ScaleHLS, and POM, correspondingly.
  • Figure 3: Framework overview
  • Figure 4: Matrix multiplication with POM DSL.
  • Figure 5: Loop tiling on the algorithm in Fig. \ref{['fig:PM_DSL']}.
  • ...and 11 more figures