Table of Contents
Fetching ...

PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs

Binglei Lou, Richard Rademacher, David Boland, Philip H. W. Leong

TL;DR

PolyLUT-Add tackles the LUT scalability bottleneck in FPGA-based LUT networks by introducing an adder-based aggregation of $A$ PolyLUT sub-neurons, boosting fan-in without prohibitive LUT growth. The approach demonstrates up to $2.7\%$ accuracy gains at the cost of 2–3× larger LUTs, while enabling substantial LUT reductions ($2.0\times$ to $13.9\times$) and latency savings ($1.2\times$ to $1.6\times$) on MNIST, Jet Substructure, and UNSW-NB15 benchmarks. Training remains offline with quantization-aware methods, and the hardware realization leverages a two-stage pipelining strategy to balance latency and throughput. Overall, PolyLUT-Add provides a practical path to high-accuracy, ultra-low-latency edge inference using LUT-based DNNs on FPGAs, with open-source tooling to support reproducibility.

Abstract

FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modeled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining $A$ PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification, and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of $2.0-13.9\times$ with a $1.2-1.6\times$ decrease in latency.

PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs

TL;DR

PolyLUT-Add tackles the LUT scalability bottleneck in FPGA-based LUT networks by introducing an adder-based aggregation of PolyLUT sub-neurons, boosting fan-in without prohibitive LUT growth. The approach demonstrates up to accuracy gains at the cost of 2–3× larger LUTs, while enabling substantial LUT reductions ( to ) and latency savings ( to ) on MNIST, Jet Substructure, and UNSW-NB15 benchmarks. Training remains offline with quantization-aware methods, and the hardware realization leverages a two-stage pipelining strategy to balance latency and throughput. Overall, PolyLUT-Add provides a practical path to high-accuracy, ultra-low-latency edge inference using LUT-based DNNs on FPGAs, with open-source tooling to support reproducibility.

Abstract

FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modeled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification, and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of with a decrease in latency.
Paper Structure (13 sections, 2 equations, 6 figures, 5 tables)

This paper contains 13 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Architecture of a (a) PolyLUT and (b) PolyLUT-Add neuron. Fan-in $F$ is 6 (each of $\beta$-bit words), and the sub-neuron number $A$ is set to 2. For simplicity, the polynomial order of each PolyLUT neuron polylut is set to 1 in this example. For each output bit, PolyLUT requires a number of lookup table entries of $2^{6 \beta}$, while PolyLUT-Add requires ($2^{3\beta}$ + $2^{3 \beta}$ + $2^{2 (\beta+1)}$).
  • Figure 2: Illustration of the LUT-based DNN inference scheme used in LogicNets LogicNets and PolyLUT polylut.
  • Figure 3: A single-layer block diagram of PolyLUT-Add.
  • Figure 4: Tool flow for PolyLUT-Add. The original open-source PolyLUT toolflow polylut components are shown in black, with modified elements in red.
  • Figure 5: Two synthesis strategies
  • ...and 1 more figures