Table of Contents
Fetching ...

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

Farah Fahim, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo Jindariani, Nhan Tran, Luca P. Carloni, Giuseppe Di Guglielmo, Philip Harris, Jeffrey Krupa, Dylan Rankin, Manuel Blanco Valentin, Josiah Hester, Yingyi Luo, John Mamish, Seda Orgrenci-Memik, Thea Aarrestad, Hamza Javed, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Sioni Summers, Javier Duarte, Scott Hauck, Shih-Chieh Hsu, Jennifer Ngadiuba, Mia Liu, Duc Hoang, Edward Kreinar, Zhenbin Wu

TL;DR

The paper addresses the need for energy-efficient, edge-enabled ML in science by delivering an open-source codesign workflow (hls4ml) that translates trained neural networks into hardware implementations for FPGA and ASIC. It combines a Python-based workflow with quantization-aware training and pruning, plus end-to-end FPGA and ASIC backends via multiple HLS toolchains, to enable low-power, real-time inference near sensors. Key contributions include quantization-aware pruning, QKeras frontend integration, and device-specific workflows that span Xilinx FPGA and ASIC targets, significantly accelerating hardware-aware ML design for scientific applications. The framework emphasizes introspection, validation, and design-space exploration to empower domain scientists to rapidly iterate and deploy efficient ML accelerators.

Abstract

Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

TL;DR

The paper addresses the need for energy-efficient, edge-enabled ML in science by delivering an open-source codesign workflow (hls4ml) that translates trained neural networks into hardware implementations for FPGA and ASIC. It combines a Python-based workflow with quantization-aware training and pruning, plus end-to-end FPGA and ASIC backends via multiple HLS toolchains, to enable low-power, real-time inference near sensors. Key contributions include quantization-aware pruning, QKeras frontend integration, and device-specific workflows that span Xilinx FPGA and ASIC targets, significantly accelerating hardware-aware ML design for scientific applications. The framework emphasizes introspection, validation, and design-space exploration to empower domain scientists to rapidly iterate and deploy efficient ML accelerators.

Abstract

Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.

Paper Structure

This paper contains 16 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: A typical workflow to translate an ML model into an FPGA or ASIC implementation using hls4ml. The red boxes (left) describe the model training and compression steps performed within conventional ML software frameworks. The hls4ml configuration and conversion steps are shown in the blue boxes (center). The black boxes (right) illustrate possible ways to export and integrate the HLS project into a larger hardware design.
  • Figure 2: Internal structure of the hls4ml package. Model converters translate models from Keras, PyTorch, etc. into an intermediate HLSModel representation. This representation can be further configured and optimized. Different backend writers can be used to export the model into a given vendor-specific language, such as Vitis HLS, Quartus HLS, Catapult HLS, or others.
  • Figure 3: Numerical profiling graph (top) from hls4ml for a fully-connected neural network (bottom). The distribution of the absolute value of the weights is shown on the x-axis. The items on the y-axis are the different weights (0) and biases (1) for the model layers.
  • Figure 4: Performance of quantization-aware training from Ref. Coelho:2020zfu in terms of the relative accuracy as a function of bit width. The relative accuracy is evaluated with respect to the floating-point baseline model. The CPU-based emulation (solid green) of the FPGA-based QAT model (solid orange) is compared to the PTQ model (dashed purple).
  • Figure 5: Performance of quantization-aware pruning using the lottery ticket pruning scheme as a function of hardware computational complexity. After QAP, the 6-bit, 80% pruned model achieves a factor of 50 reduction in BOPs compared to the 32-bit, unpruned model with no loss in performance.
  • ...and 3 more figures