FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

Benjamin Ramhorst; Vladimir Loncar; George A. Constantinides

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

Benjamin Ramhorst, Vladimir Loncar, George A. Constantinides

TL;DR

This paper tackles FPGA-based real-time neural network inference by addressing the hardware inefficiency of unstructured pruning. It proposes a hardware-aware, structured pruning framework that groups weights by DSP and BRAM usage using the reuse factor ($RF$) and formulates pruning as a knapsack optimization, solved efficiently via branch-and-cut; training includes a resource-aware regularizer and pruning proceeds iteratively to meet hardware budgets. The method integrates with hls4ml and demonstrates substantial resource reductions (e.g., 55%–92% DSP, up to 81% BRAM) while maintaining competitive accuracy on tasks including CERN LHC particle classification and standard vision datasets. This enables more feasible real-time neural inference on FPGAs and broadens the practical impact of FPGA-accelerated deep learning in high-throughput and real-time domains.

Abstract

Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. Evaluated on a range of tasks, including sub-microsecond particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves reductions ranging between 55% and 92% in the DSP utilization and up to 81% in BRAM utilization.

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

TL;DR

) and formulates pruning as a knapsack optimization, solved efficiently via branch-and-cut; training includes a resource-aware regularizer and pruning proceeds iteratively to meet hardware budgets. The method integrates with hls4ml and demonstrates substantial resource reductions (e.g., 55%–92% DSP, up to 81% BRAM) while maintaining competitive accuracy on tasks including CERN LHC particle classification and standard vision datasets. This enables more feasible real-time neural inference on FPGAs and broadens the practical impact of FPGA-accelerated deep learning in high-throughput and real-time domains.

Abstract

Paper Structure (3 sections, 3 equations, 1 figure, 1 table)

This paper contains 3 sections, 3 equations, 1 figure, 1 table.

Introduction
Resource-aware pruning
Results

Figures (1)

Figure 1: Variations of RF and the impact on resource utilization Duarte_hls4ml.

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

TL;DR

Abstract

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (1)