Table of Contents
Fetching ...

HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

Chang Sun, Zhiqiang Que, Thea K. Årrestad, Vladimir Loncar, Jennifer Ngadiuba, Wayne Luk, Maria Spiropulu

TL;DR

<3-5 sentence high-level summary> HGQ tackles sub-microsecond real-time DNN inference on FPGAs by enabling per-parameter heterogeneous bit-widths learned through differentiable fixed-point quantization, coupled with a differentiable on-chip resource estimator (EBOPs) that regularizes hardware cost. The framework preserves standard neural-network architectures and supports end-to-end FPGA deployment via da4ml/hls4ml, automating accuracy-resource trade-offs and pruning through zero-bit-width parameters. Across JSC, SVHN, and muon-tracking tasks, HGQ delivers substantial resource and latency reductions while maintaining or improving accuracy relative to prior quantization and LUT-based approaches, especially for larger models. The open-source HGQ platform aims to enable real-time triggers in CERN experiments and other latency-critical domains by providing scalable, hardware-aware QAT for FPGAs.

Abstract

Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.

HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

TL;DR

<3-5 sentence high-level summary> HGQ tackles sub-microsecond real-time DNN inference on FPGAs by enabling per-parameter heterogeneous bit-widths learned through differentiable fixed-point quantization, coupled with a differentiable on-chip resource estimator (EBOPs) that regularizes hardware cost. The framework preserves standard neural-network architectures and supports end-to-end FPGA deployment via da4ml/hls4ml, automating accuracy-resource trade-offs and pruning through zero-bit-width parameters. Across JSC, SVHN, and muon-tracking tasks, HGQ delivers substantial resource and latency reductions while maintaining or improving accuracy relative to prior quantization and LUT-based approaches, especially for larger models. The open-source HGQ platform aims to enable real-time triggers in CERN experiments and other latency-critical domains by providing scalable, hardware-aware QAT for FPGAs.

Abstract

Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.
Paper Structure (20 sections, 12 equations, 6 figures, 1 table)

This paper contains 20 sections, 12 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: An illustration of the HGQ quantization scheme on weights and activations. In this example, each weight and activation has its own learnable bit-width. Parameters with zero bit-width are effectively pruned.
  • Figure 2: Overall workflow of the HGQ framework. The dark blue flow uses the da4ml backend, and the brown flow uses the hls4ml backend with optional DA optimizations.
  • Figure 3: The relationship between EBOPs and the post place-and-route resource consumption with (top left) and without (top right) DA optimization, and the distribution of error between the estimated and actual LUT consumption after place-and-route with DA (lower). For the error distribution, $\mathrm{LUT}_\mathrm{pred}$ is given by $\exp(0.985\cdot\log(\mathrm{EBOPs}))$.
  • Figure 4: Distributions of the weights and data activations bit-widths of the HGQ models for HLF JSC task, CERNBox version.
  • Figure 5: Distributions of the weights and data activations bit-widths of the HGQ models for the muon tracking task.
  • ...and 1 more figures