HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs
Chang Sun, Zhiqiang Que, Thea K. Årrestad, Vladimir Loncar, Jennifer Ngadiuba, Wayne Luk, Maria Spiropulu
TL;DR
<3-5 sentence high-level summary> HGQ tackles sub-microsecond real-time DNN inference on FPGAs by enabling per-parameter heterogeneous bit-widths learned through differentiable fixed-point quantization, coupled with a differentiable on-chip resource estimator (EBOPs) that regularizes hardware cost. The framework preserves standard neural-network architectures and supports end-to-end FPGA deployment via da4ml/hls4ml, automating accuracy-resource trade-offs and pruning through zero-bit-width parameters. Across JSC, SVHN, and muon-tracking tasks, HGQ delivers substantial resource and latency reductions while maintaining or improving accuracy relative to prior quantization and LUT-based approaches, especially for larger models. The open-source HGQ platform aims to enable real-time triggers in CERN experiments and other latency-critical domains by providing scalable, hardware-aware QAT for FPGAs.
Abstract
Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.
