Table of Contents
Fetching ...

smallNet: Implementation of a convolutional layer in tiny FPGAs

Fernanda Zapata Bascuñán, Alan Ezequiel Fuster

TL;DR

Deploying CNNs on resource-limited embedded hardware is challenging due to compute and memory demands. The authors implement smallNet, a hand-coded Verilog CNN with two 2-by-2 convolutional layers, pooling, and a dense output, validated on a Zynq-7000 board with AXI-DMA integration. Hardware inference achieves about 109 ms per pass with a roughly 5.1× speedup over software and hardware accuracy around 81% on MNIST, while consuming about 1.5 W. This work demonstrates a cost-effective, energy-efficient path for embedded AI on low-cost FPGAs and provides a foundation for future ASIC adoption in constrained environments.

Abstract

Since current neural network development systems in Xilinx and VLSI require codevelopment with Python libraries, the first stage of a convolutional network has been implemented by developing a convolutional layer entirely in Verilog. This handcoded design, free of IP cores and based on a filter polynomial like structure, enables straightforward deployment not only on low cost FPGAs but also on SoMs, SoCs, and ASICs. We analyze the limitations of numerical representations and compare our implemented architecture, smallNet, with its computer based counterpart, demonstrating a 5.1x speedup, over 81% classification accuracy, and a total power consumption of just 1.5 W. The algorithm is validated on a single-core Cora Z7, demonstrating its feasibility for real time, resource-constrained embedded applications.

smallNet: Implementation of a convolutional layer in tiny FPGAs

TL;DR

Deploying CNNs on resource-limited embedded hardware is challenging due to compute and memory demands. The authors implement smallNet, a hand-coded Verilog CNN with two 2-by-2 convolutional layers, pooling, and a dense output, validated on a Zynq-7000 board with AXI-DMA integration. Hardware inference achieves about 109 ms per pass with a roughly 5.1× speedup over software and hardware accuracy around 81% on MNIST, while consuming about 1.5 W. This work demonstrates a cost-effective, energy-efficient path for embedded AI on low-cost FPGAs and provides a foundation for future ASIC adoption in constrained environments.

Abstract

Since current neural network development systems in Xilinx and VLSI require codevelopment with Python libraries, the first stage of a convolutional network has been implemented by developing a convolutional layer entirely in Verilog. This handcoded design, free of IP cores and based on a filter polynomial like structure, enables straightforward deployment not only on low cost FPGAs but also on SoMs, SoCs, and ASICs. We analyze the limitations of numerical representations and compare our implemented architecture, smallNet, with its computer based counterpart, demonstrating a 5.1x speedup, over 81% classification accuracy, and a total power consumption of just 1.5 W. The algorithm is validated on a single-core Cora Z7, demonstrating its feasibility for real time, resource-constrained embedded applications.

Paper Structure

This paper contains 12 sections, 5 figures.

Figures (5)

  • Figure 1: A CNN architecture that adds convolutional layers and pooling layers before dense layers author2020.
  • Figure 2: Architecture of smallNet, a lightweight convolutional neural network with 550 trainable parameters.
  • Figure 3: Validation of loss and accuracy during the training of smallNet in the Keras environment.
  • Figure 4: Hardware implementation of the convolutional neuron in Verilog. The design includes a windowing module, parallel multiply-accumulate (MAC) units with bias addition, followed by an activation function. The control logic FSM manages data flow and synchronization across the pipeline, including the application of the windowing operation.
  • Figure 5: System-level architecture integrating smallNet, a lightweight convolutional neural network deployed on a Zynq SoC. Data is streamed through a FIFO buffer and processed by the neural network, with GPIOs and interrupts coordinating control between programmable logic and the processing system. Results are sent via UART.