Table of Contents
Fetching ...

A DSP shared is a DSP earned: HLS Task-Level Multi-Pumping for High-Performance Low-Resource Designs

Giovanni Brignone, Mihai T. Lazarescu, Luciano Lavagno

TL;DR

The paper tackles the challenge of reducing FPGA DSP resource usage in high-level synthesis (HLS) designs without sacrificing kernel throughput. It introduces a task-level multi-pumping approach that generalizes dataflow graphs (DFGs) to multi-clock DFGs (MCDFG) and leverages HLS resource sharing to reuse the same functional units across multiple operations while increasing per-task clock frequencies. By optimizing per-task multi-pumping factors and synthesizing MCDFGs, the method yields a new Pareto front in the throughput-versus-DSP-resource space, achieving up to 40–54% DSP savings at the same throughput and up to ~52% throughput gains with the same DSP on open-source benchmarks. The workflow is validated on real FPGA platforms using state-of-the-art Xilinx tools, demonstrating practical impact and suggesting a path toward automated, high-level optimization passes that exploit multi-clock pipelines and per-task resource sharing.

Abstract

High-level synthesis (HLS) enhances digital hardware design productivity through a high abstraction level. Even if the HLS abstraction prevents fine-grained manual register-transfer level (RTL) optimizations, it also enables automatable optimizations that would be unfeasible or hard to automate at RTL. Specifically, we propose a task-level multi-pumping methodology to reduce resource utilization, particularly digital signal processors (DSPs), while preserving the throughput of HLS kernels modeled as dataflow graphs (DFGs) targeting field-programmable gate arrays. The methodology exploits the HLS resource sharing to automatically insert the logic for reusing the same functional unit for different operations. In addition, it relies on multi-clock DFG s to run the multi-pumped tasks at higher frequencies. The methodology scales the pipeline initiation interval (II) and the clock frequency constraints of resource-intensive tasks by a multi-pumping factor (M). The looser II allows sharing the same resource among M different operations, while the tighter clock frequency preserves the throughput. We verified that our methodology opens a new Pareto front in the throughput and resource space by applying it to open-source HLS designs using state-of-the-art commercial HLS and implementation tools by Xilinx. The multi-pumped designs require up to 40% fewer DSP resources at the same throughput as the original designs optimized for performance (i.e., running at the maximum clock frequency) and achieve up to 50% better throughput using the same DSP s as the original designs optimized for resources with a single clock.

A DSP shared is a DSP earned: HLS Task-Level Multi-Pumping for High-Performance Low-Resource Designs

TL;DR

The paper tackles the challenge of reducing FPGA DSP resource usage in high-level synthesis (HLS) designs without sacrificing kernel throughput. It introduces a task-level multi-pumping approach that generalizes dataflow graphs (DFGs) to multi-clock DFGs (MCDFG) and leverages HLS resource sharing to reuse the same functional units across multiple operations while increasing per-task clock frequencies. By optimizing per-task multi-pumping factors and synthesizing MCDFGs, the method yields a new Pareto front in the throughput-versus-DSP-resource space, achieving up to 40–54% DSP savings at the same throughput and up to ~52% throughput gains with the same DSP on open-source benchmarks. The workflow is validated on real FPGA platforms using state-of-the-art Xilinx tools, demonstrating practical impact and suggesting a path toward automated, high-level optimization passes that exploit multi-clock pipelines and per-task resource sharing.

Abstract

High-level synthesis (HLS) enhances digital hardware design productivity through a high abstraction level. Even if the HLS abstraction prevents fine-grained manual register-transfer level (RTL) optimizations, it also enables automatable optimizations that would be unfeasible or hard to automate at RTL. Specifically, we propose a task-level multi-pumping methodology to reduce resource utilization, particularly digital signal processors (DSPs), while preserving the throughput of HLS kernels modeled as dataflow graphs (DFGs) targeting field-programmable gate arrays. The methodology exploits the HLS resource sharing to automatically insert the logic for reusing the same functional unit for different operations. In addition, it relies on multi-clock DFG s to run the multi-pumped tasks at higher frequencies. The methodology scales the pipeline initiation interval (II) and the clock frequency constraints of resource-intensive tasks by a multi-pumping factor (M). The looser II allows sharing the same resource among M different operations, while the tighter clock frequency preserves the throughput. We verified that our methodology opens a new Pareto front in the throughput and resource space by applying it to open-source HLS designs using state-of-the-art commercial HLS and implementation tools by Xilinx. The multi-pumped designs require up to 40% fewer DSP resources at the same throughput as the original designs optimized for performance (i.e., running at the maximum clock frequency) and achieve up to 50% better throughput using the same DSP s as the original designs optimized for resources with a single clock.
Paper Structure (16 sections, 6 equations, 4 figures, 1 table)

This paper contains 16 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Task-level multi-pumping saves resources at equal throughput for *HLS of *DFG. The Filter2D task from a 2D Convolution kernel xilinx_conv\ref{['subfig:conv2d_srdfg']} is double-pumped \ref{['subfig:filter2d_500']} by doubling its clock frequency and *II to save half of the multipliers of the single-clock solution \ref{['subfig:filter2d_250']}.
  • Figure 2: Given the C/C++ source code of a *DFG application and its base clock frequency, the proposed workflow builds the optimized multi-pumped *IP by analyzing the *DFG (*DFG charact.),optimizing the multi-pumping factors ($\mathcal{M}$ opt.), andsynthesizing the multi-pumped *IP (*MCDFG synth.).
  • Figure 3: The pipeline *II directly affects the resource sharing. For example, in the Filter2D task \ref{['subfig:filter_src']}, the pipeline with $\mathit{\acs*{II}} = 1\cycle$\ref{['subfig:filter_ii_1']} computes four multiplications per clock cycle in steady state, while the one with $\mathit{\acs*{II}} = 2\cycles$\ref{['subfig:filter_ii_2']} only two. Thus, the latter datapath allocates half of the multipliers.
  • Figure 4: *DSP allocated for a given throughput. The M-Pump designs are optimized using the proposed task-level multi-pumping technique. The M-Pump designs are Pareto-optimal compared to the Base designs, whose *DSP utilization is constant since they are optimized by tuning the clock frequency only, and to the S-Pump designs, which are optimized for area by changing both the *II and the global clock frequency of the tasks. The dashed lines represent the theoretical throughputs achievable with the allocated *DSP, which are unreachable in practice due to memory bandwidth limitations. The dots show the design points implemented in *HW.