Table of Contents
Fetching ...

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

Daniel Gerlinghoff, Benjamin Chen Ming Choong, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo

TL;DR

This paper introduces Table Lookup Multiply-Accumulate (TLMAC) as a framework to compile and optimise quantised neural networks for scalable lookup-based processing and demonstrates that TLMAC significantly improves the scalability of previous related works.

Abstract

Recent advancements in neural network quantisation have yielded remarkable outcomes, with three-bit networks reaching state-of-the-art full-precision accuracy in complex tasks. These achievements present valuable opportunities for accelerating neural networks by computing in reduced precision. Implementing it on FPGAs can take advantage of bit-level reconfigurability, which is not available on conventional CPUs and GPUs. Simultaneously, the high data intensity of neural network processing has inspired computing-in-memory paradigms, including on FPGA platforms. By programming the effects of trained model weights as lookup operations in soft logic, the transfer of weight data from memory units can be avoided, alleviating the memory bottleneck. However, previous methods face poor scalability - the high logic utilisation limiting them to small networks/sub-networks of binary models with low accuracy. In this paper, we introduce Table Lookup Multiply-Accumulate (TLMAC) as a framework to compile and optimise quantised neural networks for scalable lookup-based processing. TLMAC clusters and maps unique groups of weights to lookup-based processing elements, enabling highly parallel computation while taking advantage of parameter redundancy. Further place and route algorithms are proposed to reduce LUT utilisation and routing congestion. We demonstrate that TLMAC significantly improves the scalability of previous related works. Our efficient logic mapping and high degree of reuse enables entire ImageNet-scale quantised models with full-precision accuracy to be implemented using lookup-based computing on one commercially available FPGA.

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

TL;DR

This paper introduces Table Lookup Multiply-Accumulate (TLMAC) as a framework to compile and optimise quantised neural networks for scalable lookup-based processing and demonstrates that TLMAC significantly improves the scalability of previous related works.

Abstract

Recent advancements in neural network quantisation have yielded remarkable outcomes, with three-bit networks reaching state-of-the-art full-precision accuracy in complex tasks. These achievements present valuable opportunities for accelerating neural networks by computing in reduced precision. Implementing it on FPGAs can take advantage of bit-level reconfigurability, which is not available on conventional CPUs and GPUs. Simultaneously, the high data intensity of neural network processing has inspired computing-in-memory paradigms, including on FPGA platforms. By programming the effects of trained model weights as lookup operations in soft logic, the transfer of weight data from memory units can be avoided, alleviating the memory bottleneck. However, previous methods face poor scalability - the high logic utilisation limiting them to small networks/sub-networks of binary models with low accuracy. In this paper, we introduce Table Lookup Multiply-Accumulate (TLMAC) as a framework to compile and optimise quantised neural networks for scalable lookup-based processing. TLMAC clusters and maps unique groups of weights to lookup-based processing elements, enabling highly parallel computation while taking advantage of parameter redundancy. Further place and route algorithms are proposed to reduce LUT utilisation and routing congestion. We demonstrate that TLMAC significantly improves the scalability of previous related works. Our efficient logic mapping and high degree of reuse enables entire ImageNet-scale quantised models with full-precision accuracy to be implemented using lookup-based computing on one commercially available FPGA.
Paper Structure (26 sections, 6 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 6 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of steps involved in obtaining FPGA soft logic for neural networks. While prior research wang2019lutnetwang2022logic customises and constrains the training process to achieve LUT compatibility, TLMAC derives optimised LUT initialisations from state-of-the-art quantised models directly.
  • Figure 2: A window of $1 \times D_k$ with $D_k = 3$ values from the input tensor are passed to the TLMAC PE along with the current step index 0 along the $D_s$ dimension. Using all 3 kernel rows in parallel, the PE produces partial sums in three rows of the output feature map, spanning 64 channels. While the first row is fully completed, other ones are pending accumulation with partial sums generated from values in the subsequent rows of the input feature map.
  • Figure 3: TLMAC processing element consisting of LUT pool, switches and accumulators. $N_{\text{arr}}$ LUT arrays consist of LUTs to generate $N_{\text{lut}}$ output bits. Each LUT array can store $N_{\text{clus}}$ weight groups. Hardwired routing connects LUT outputs to switches that select which MAC result is to be accumulated. The place & route algorithms target $N_{\text{arr}}$ and routing connections, respectively.
  • Figure 4: From the weight tensor with parallel and sequential dimensions, the unique weight groups are extracted. A binary assignment matrix $\mathbf{C}$ is derived that shows which of the unique weight groups are involved in every step along $D_s$. After clustering of $D_s$, the horizontal assignment of weights to their respective cluster is fixed. Simulated annealing determines the vertical assignment within $N_{\text{arr}}$ LUT arrays.
  • Figure 5: Lines show the number of unique weight groups in convolution layers within ResNet-18's basic blocks. The theoretical maximum of weight groups based on the weight bit width and kernel size is given by the dashed horizontal lines, respectively. The result of the clustering are $N_{\text{arr}}$ LUT arrays per layer represented by the bars.
  • ...and 4 more figures