HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

Federico Nicolas Peccia; Luciano Ferreyro; Alejandro Furfaro

HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

Federico Nicolas Peccia, Luciano Ferreyro, Alejandro Furfaro

TL;DR

This work tackles efficient CNN inference on resource-constrained FPGA devices by introducing a compact hardware accelerator built from small, reusable systolic arrays and a scheduling-driven dataflow that leverages pruning sparsity. The key contribution is Hardware Aware Pruning Method (HAPM), which prunes weight groups aligned to the hardware's parallel execution pattern, retraining to preserve accuracy; a Dynamic Sparsity Bypass (DSB) further accelerates zero-valued computations. The authors validate the design on Zybo and Zedboard with a ResNet-like network trained on CIFAR-10, demonstrating up to 45% faster inference per image with HAPM compared to standard pruning, and discuss theoretical performance modeling versus measured results. The work highlights the importance of hardware-aware sparsity and dataflow scheduling for achieving practical speedups on low-resource FPGAs and points to future directions in energy analyses, memory compression, and scheduling optimizations. Overall, HAPM provides a concrete pathway to reconcile pruning-driven model compression with the architectural realities of FPGA accelerators for embedded vision tasks.

Abstract

During the last years, algorithms known as Convolutional Neural Networks (CNNs) had become increasingly popular, expanding its application range to several areas. In particular, the image processing field has experienced a remarkable advance thanks to this algorithms. In IoT, a wide research field aims to develop hardware capable of execute them at the lowest possible energy cost, but keeping acceptable image inference time. One can get around this apparently conflicting objectives by applying design and training techniques. The present work proposes a generic hardware architecture ready to be implemented on FPGA devices, supporting a wide range of configurations which allows the system to run different neural network architectures, dynamically exploiting the sparsity caused by pruning techniques in the mathematical operations present in this kind of algorithms. The inference speed of the design is evaluated over different resource constrained FPGA devices. Finally, the standard pruning algorithm is compared against a custom pruning technique specifically designed to exploit the scheduling properties of this hardware accelerator. We demonstrate that our hardware-aware pruning algorithm achieves a remarkable improvement of a 45 % in inference time compared to a network pruned using the standard algorithm.

HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

TL;DR

Abstract

Paper Structure (17 sections, 7 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 7 figures, 2 tables, 3 algorithms.

Introduction
Design of the hardware architecture
Optimization of the convolution operation
Core processing element
Computation units matrix
Matrix block
Convolution scheduling
Block diagram of the entire architecture
Hardware Aware Pruning Method
Materials and methods
Validation
Training
Theoretical accelerator performance
Measurements
Discussion
...and 2 more sections

Figures (7)

Figure 1: Simplified block diagram of a PE used in the design (\ref{['fig:sys_array_example_a']}), scheduling of a single $3\times3$ kernel convolution on a computation units matrix of $CU_x = 2$ and $CU_y = 3$ (\ref{['fig:sys_array_example_b']}) and $N_{CU}$ computation units matrices sharing data, kernel and partial sum buses (\ref{['fig:sys_array_example_c']}).
Figure 2: General block diagram of the proposed design
Figure 3: Training and validation accuracy scores curves of the trained models
Figure 4: Sparsity per layer at the end of the training for models 3 and 4. Notice how our method chooses to almost suppress some layers, while keeping others practically intact.
Figure 5: Theoretical performance of different configurations of the hardware accelerator for the chosen CNN, calculated at 100 MHz. Each plot represents one $CU_x$ parameter configuration.
...and 2 more figures

HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

TL;DR

Abstract

HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

Authors

TL;DR

Abstract

Table of Contents

Figures (7)