Table of Contents
Fetching ...

SILVIA: Automated Superword-Level Parallelism Exploitation via HLS-Specific LLVM Passes for Compute-Intensive FPGA Accelerators

Giovanni Brignone, Roberto Bosio, Fabrizio Ottati, Claudio Sansoè, Luciano Lavagno

TL;DR

SILVIA is an open-source LLVM transformation pass that automatically identifies superword-level parallelism within an HLS design and exploits it by packing multiple operations, such as additions, multiplications, and multiply-and-adds, into a single DSP.

Abstract

High-level synthesis (HLS) aims at democratizing custom hardware acceleration with highly abstracted software-like descriptions. However, efficient accelerators still require substantial low-level hardware optimizations, defeating the HLS intent. In the context of field-programmable gate arrays, digital signal processors (DSPs) are a crucial resource that typically requires a significant optimization effort for its efficient utilization, especially when used for sub-word vectorization. This work proposes SILVIA, an open-source LLVM transformation pass that automatically identifies superword-level parallelism within an HLS design and exploits it by packing multiple operations, such as additions, multiplications, and multiply-and-adds, into a single DSP. SILVIA is integrated in the flow of the commercial AMD Vitis HLS tool and proves its effectiveness by packing multiple operations on the DSPs without any manual source-code modifications on several diverse state-of-the-art HLS designs such as convolutional neural networks and basic linear algebra subprograms accelerators, reducing the DSP utilization for additions by 70 % and for multiplications and multiply-and-adds by 50 % on average.

SILVIA: Automated Superword-Level Parallelism Exploitation via HLS-Specific LLVM Passes for Compute-Intensive FPGA Accelerators

TL;DR

SILVIA is an open-source LLVM transformation pass that automatically identifies superword-level parallelism within an HLS design and exploits it by packing multiple operations, such as additions, multiplications, and multiply-and-adds, into a single DSP.

Abstract

High-level synthesis (HLS) aims at democratizing custom hardware acceleration with highly abstracted software-like descriptions. However, efficient accelerators still require substantial low-level hardware optimizations, defeating the HLS intent. In the context of field-programmable gate arrays, digital signal processors (DSPs) are a crucial resource that typically requires a significant optimization effort for its efficient utilization, especially when used for sub-word vectorization. This work proposes SILVIA, an open-source LLVM transformation pass that automatically identifies superword-level parallelism within an HLS design and exploits it by packing multiple operations, such as additions, multiplications, and multiply-and-adds, into a single DSP. SILVIA is integrated in the flow of the commercial AMD Vitis HLS tool and proves its effectiveness by packing multiple operations on the DSPs without any manual source-code modifications on several diverse state-of-the-art HLS designs such as convolutional neural networks and basic linear algebra subprograms accelerators, reducing the DSP utilization for additions by 70 % and for multiplications and multiply-and-adds by 50 % on average.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: The modified *HLS workflow with SILVIA. SILVIA optimizes the original LLVM *IR generated by the *FE for *DSP-packed operations and provides it as input to the *BE.
  • Figure 2: The bit mapping of the proposed method for computing four multiplications between four 4-bit unsigned factors and one common 4-bit factor (signed or unsigned).
  • Figure 3: The C code defined in \ref{['subfig:c_source']} compiles to the LLVM code \ref{['subfig:llvm_source']} where the two mul instructions (i.e., c0 and c1) are incompatible for vectorization since c0 is used before the definition of c1. SILVIA rearranges the code \ref{['subfig:llvm_rearrange']} to make c0 and c1 compatible, by moving the uses of c0*ALAP while preserving the functionality. Finally, SILVIA replaces the two mul instructions with a function call to the corresponding *DSP-packed implementation \ref{['subfig:llvm_packed']}.
  • Figure 4: Example of an edge-case design where packing multiple operations to the same *DSP is detrimental to the *II of the pipeline. The *DDG \ref{['subfig:critical_cycle_ddg_orig']} corresponding to the original source code \ref{['subfig:critical_cycle_src']}, where the nodes are the instructions and the edges are the data dependencies between the instructions labeled with their distance, has a critical cycle (highlighted in dark blue) which determines a minimum *II of 2 clock cycles (i.e., the maximum ceiled ratio between the total latency and the total distance along any cycle in the *DDG), assuming a latency of 1 clock cycle for each operation. Packing a and b to the same *DSP introduces a new critical cycle in the *DDG \ref{['subfig:critical_cycle_ddg_pack']} that increases the minimum *II to 3 clock cycles.
  • Figure 5: Modifications to the Vitis HLS synthesis script for executing the optimized SILVIA flow. Users just need to update the Vitis HLS synthesis script by specifying which SILVIA passes to run, via the SILVIA::PASSES list, and running the custom SILVIA::csynth_design command.
  • ...and 2 more figures