PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs
Binglei Lou, Richard Rademacher, David Boland, Philip H. W. Leong
TL;DR
PolyLUT-Add tackles the LUT scalability bottleneck in FPGA-based LUT networks by introducing an adder-based aggregation of $A$ PolyLUT sub-neurons, boosting fan-in without prohibitive LUT growth. The approach demonstrates up to $2.7\%$ accuracy gains at the cost of 2–3× larger LUTs, while enabling substantial LUT reductions ($2.0\times$ to $13.9\times$) and latency savings ($1.2\times$ to $1.6\times$) on MNIST, Jet Substructure, and UNSW-NB15 benchmarks. Training remains offline with quantization-aware methods, and the hardware realization leverages a two-stage pipelining strategy to balance latency and throughput. Overall, PolyLUT-Add provides a practical path to high-accuracy, ultra-low-latency edge inference using LUT-based DNNs on FPGAs, with open-source tooling to support reproducibility.
Abstract
FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modeled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining $A$ PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification, and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of $2.0-13.9\times$ with a $1.2-1.6\times$ decrease in latency.
