smallNet: Implementation of a convolutional layer in tiny FPGAs
Fernanda Zapata Bascuñán, Alan Ezequiel Fuster
TL;DR
Deploying CNNs on resource-limited embedded hardware is challenging due to compute and memory demands. The authors implement smallNet, a hand-coded Verilog CNN with two 2-by-2 convolutional layers, pooling, and a dense output, validated on a Zynq-7000 board with AXI-DMA integration. Hardware inference achieves about 109 ms per pass with a roughly 5.1× speedup over software and hardware accuracy around 81% on MNIST, while consuming about 1.5 W. This work demonstrates a cost-effective, energy-efficient path for embedded AI on low-cost FPGAs and provides a foundation for future ASIC adoption in constrained environments.
Abstract
Since current neural network development systems in Xilinx and VLSI require codevelopment with Python libraries, the first stage of a convolutional network has been implemented by developing a convolutional layer entirely in Verilog. This handcoded design, free of IP cores and based on a filter polynomial like structure, enables straightforward deployment not only on low cost FPGAs but also on SoMs, SoCs, and ASICs. We analyze the limitations of numerical representations and compare our implemented architecture, smallNet, with its computer based counterpart, demonstrating a 5.1x speedup, over 81% classification accuracy, and a total power consumption of just 1.5 W. The algorithm is validated on a single-core Cora Z7, demonstrating its feasibility for real time, resource-constrained embedded applications.
