Table of Contents
Fetching ...

LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

Yanyue Xie, Zhengang Li, Dana Diaconu, Suranga Handagala, Miriam Leeser, Xue Lin

TL;DR

LUTMUL is introduced, which harnesses the potential of look-up tables (LUTs) for performing multiplications in FPGA-based neural network accelerators with a reconfigurable dataflow architecture and sets a new benchmark for efficient neural network inference on FPGAs.

Abstract

For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.

LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

TL;DR

LUTMUL is introduced, which harnesses the potential of look-up tables (LUTs) for performing multiplications in FPGA-based neural network accelerators with a reconfigurable dataflow architecture and sets a new benchmark for efficient neural network inference on FPGAs.

Abstract

For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.

Paper Structure

This paper contains 18 sections, 5 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Roofline model analysis for LUTMUL and other DSP-based architectures. We take $\frac{1}{64}$ resource and memory bandwidth of U280 for analysis.
  • Figure 2: Accuracy loss and LUT resources for 1-bit to 8-bit quantization.
  • Figure 3: LUTMUL Design flow.
  • Figure 4: Hardware architecture of accelerator generated by LUTMUL. Our design is fully on-chip and does not use DRAM or HBM memory.
  • Figure 5: Illustration of LUTMUL for efficient multiplication via look-up tables. The left-hand side figure demonstrates how to use LUT6_2 primitive for embedding multiplication results of weights and input activations. The right-hand side table demonstrates the multiplication results of two example weights and how to generate the corresponding look-up table contents. The weights (int4) and multiplication output (int8) are using two's complement representation, while activation are all unsigned numbers (uint4). The Most Significant Bit (MSB) of LUT6_2 input is configured as '1' to enable two output ports. The bit below the MSB is a Weight Select (WS) signal to select between two weights. The lowest 4-bit inputs serve as activation inputs. Our method embeds two int4 weights inside four LUT6, a resource-efficient approach contrasting with the LUT6-instantiated general multipliers, which consume 6-14$\times$ more LUT6 resources. Two used example weights are 1 and -3 respectively. The embedded LUT contents for these four LUTs are 64'hfffe_0000_fffe_0000, 64'h07fe_0000_f83e_0000, 64'h39c6_ff00_5a5a_f0f0, and 64'hcccc_cccc_aaaa_aaaa, respectively.
  • ...and 1 more figures