Table of Contents
Fetching ...

BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks

Simon Bührer, Andreas Plesner, Aczel Till, Roger Wattenhofer

TL;DR

BitLogic presents a gradient-based framework for FPGA-native neural networks built around differentiable LUT nodes that map directly to FPGA LUT primitives, enabling end-to-end training and native FPGA deployment. The approach replaces MACs with learnable LUTs, supports encoders and hardware-friendly heads, and provides an automated RTL export pipeline that yields equivalent software and hardware inference. The paper surveys LUT-based and differentiable architectures, introduces boundary-consistent relaxations for training, and demonstrates competitive accuracy with substantial FPGA efficiency gains across standard vision benchmarks. It also shows sub-20 ns per-sample FPGA inference with modest LUT counts, highlighting the practicality of LUT-based FPGA deployment for edge and datacenter workloads.

Abstract

The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.

BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks

TL;DR

BitLogic presents a gradient-based framework for FPGA-native neural networks built around differentiable LUT nodes that map directly to FPGA LUT primitives, enabling end-to-end training and native FPGA deployment. The approach replaces MACs with learnable LUTs, supports encoders and hardware-friendly heads, and provides an automated RTL export pipeline that yields equivalent software and hardware inference. The paper surveys LUT-based and differentiable architectures, introduces boundary-consistent relaxations for training, and demonstrates competitive accuracy with substantial FPGA efficiency gains across standard vision benchmarks. It also shows sub-20 ns per-sample FPGA inference with modest LUT counts, highlighting the practicality of LUT-based FPGA deployment for edge and datacenter workloads.

Abstract

The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.
Paper Structure (66 sections, 39 equations, 4 figures, 8 tables)

This paper contains 66 sections, 39 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: LUT representations. Left: discrete lookup table mapping ${\mathbf{x}} \in \{0,1\}^n$ to ${\textnormal{y}} \in \{0,1\}$. Right: continuous relaxation with ${\mathbf{x}} \in [0,1]^n$ and ${\textnormal{y}} \in [0,1]$ for gradient-based training.
  • Figure 2: Declarative MNIST CNN: Thermometer encoder (N: 8) feeds two TopK-Sparse convolutional layers (k: 8, Hybrid nodes with input dimension 6) that reduce spatial dimensions via stride-2 convolutions. Features are flattened and processed by a TopK-Sparse lookup layer (input: 4, k: 8), then aggregated by a GroupSum head into 10 class predictions. Tensor shapes annotated on edges.
  • Figure 3: End-to-end FPGA deployment methodology. The pipeline transforms a trained PyTorch model into deployable hardware through automated RTL generation, synthesis and implementation, verification (including DRC), and bitstream generation.
  • Figure :