Table of Contents
Fetching ...

TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design

Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong

TL;DR

This article proposes TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator and adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths.

Abstract

In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge.

TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design

TL;DR

This article proposes TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator and adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths.

Abstract

In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge.
Paper Structure (38 sections, 11 equations, 19 figures, 11 tables)

This paper contains 38 sections, 11 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: An overview of our TAPA framework. The input is a task-parallel dataflow program written in C/C++ with the TAPA APIs. We first invoke the TAPA compiler to extract the parallel tasks and synthesize each task using Vitis HLS to get its RTL representation and obtain an estimated area. Then the AutoBridge fpga21-autobridge module of TAPA floorplans the program and determines a target region for each task. Based on the floorplan, we intelligently compute the pipeline stages of the communication logic between tasks and ensure that throughput will not degrade. TAPA generates the actual RTL of the pipeline logic that composes together the tasks. A constraint file is also generated to pass the floorplan information to the downstream tools.
  • Figure 2: Block diagrams of three representative FPGA architectures: the Xilinx Alveo U250, U280 (based on the Xilinx UltraScale+ architecture), and the Intel Stratix 10.
  • Figure 3: Implementation results of a CNN accelerator on the Xilinx U250 FPGA. Spreading the tasks across the device helps reduce local congestion, while the die-crossing wires are additionally pipelined.
  • Figure 4: Implementation results of a stencil computing design on U280. Floorplanning during HLS compilation significantly benefits the physical design tools.
  • Figure 5: Accelerator task instantiation in TAPA.
  • ...and 14 more figures