NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference
Marta Andronic, George A. Constantinides
TL;DR
NeuraLUT-Assemble tackles the accuracy bottleneck of LUT-based neural networks on FPGAs by assembling multiple small L-LUT neurons into tree-structured, hardware-aware sub-networks. The method interleaves mixed-precision training, learned input mappings, and skip-connections inside LUTs to enable deeper connectivity without exploding LUT input fan-in. It features a full toolflow—from quantization-aware training in PyTorch to RTL generation and FPGA synthesis—that supports configurable pipelining for latency or throughput optimization. Empirical results on MNIST, jet substructure, and network intrusion datasets show substantial area-delay product reductions while maintaining competitive accuracy, outperforming prior LUT-based approaches in efficiency and scalability. The work highlights the potential of hardware-aware AI designs for ultra-low-latency edge inference and provides an open-source framework for broader adoption.
Abstract
Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications, particularly when deployed on field-programmable gate arrays (FPGAs) for edge computing. These architectures promise ultra-low latency and reduced resource utilization, broadening neural network adoption in fields such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. Additionally, we introduce skip-connections across entire LUT structures to improve gradient flow. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification, demonstrating up to $8.42\times$ reduction in the area-delay product compared to the state-of-the-art at the time of the publication.
