Table of Contents
Fetching ...

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Marta Andronic, George A. Constantinides

TL;DR

PolyLUT tackles ultra-low-latency DNN inference on FPGAs by learning multivariate polynomials for neuron functions and embedding evaluation inside LUTs. By expanding inputs with monomials up to degree $D$ (with $M=\binom{F+D}{D}$ terms) and mapping neurons to L-LUTs, the approach achieves shallower networks without sacrificing accuracy, delivering sub-10 ns deployment in practice. Experiments on UNSW-NB15, MNIST, and jet substructure demonstrate substantial latency and LUT-count reductions compared to prior LUT-based methods, while maintaining competitive accuracy. The work provides an open-source training framework and RTL-generation flow, enabling end-to-end DNN-to-LUT deployment on a single FPGA, with potential extensions via neural-architecture search to further optimize depth, bit-width, and degree.

Abstract

Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

TL;DR

PolyLUT tackles ultra-low-latency DNN inference on FPGAs by learning multivariate polynomials for neuron functions and embedding evaluation inside LUTs. By expanding inputs with monomials up to degree (with terms) and mapping neurons to L-LUTs, the approach achieves shallower networks without sacrificing accuracy, delivering sub-10 ns deployment in practice. Experiments on UNSW-NB15, MNIST, and jet substructure demonstrate substantial latency and LUT-count reductions compared to prior LUT-based methods, while maintaining competitive accuracy. The work provides an open-source training framework and RTL-generation flow, enabling end-to-end DNN-to-LUT deployment on a single FPGA, with potential extensions via neural-architecture search to further optimize depth, bit-width, and degree.

Abstract

Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
Paper Structure (27 sections, 2 equations, 8 figures, 4 tables)

This paper contains 27 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Structural use of LUTs for a single output channel in prior works (a,b) and our approach (c).
  • Figure 2: Visual representation of a 3-layer network.
  • Figure 3: Input transformations visualized as contour graphs at the output of each neuron. The black and red dots represent the training data points.
  • Figure 4: High-level view of PolyLUT's toolflow. We built upon the open-source LogicNets toolflow. We show in red the elements that were modified.
  • Figure 5: Training loss variation with the number of layers across six different polynomial degrees.
  • ...and 3 more figures