Table of Contents
Fetching ...

hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware

Jan-Frederik Schulte, Benjamin Ramhorst, Chang Sun, Jovan Mitrevski, Nicolò Ghielmetti, Enrico Lupi, Dimitrios Danopoulos, Vladimir Loncar, Javier Duarte, David Burnette, Lauri Laatu, Stylianos Tzelepis, Konstantinos Axiotis, Quentin Berthet, Haoyan Wang, Paul White, Suleyman Demirsoy, Marco Colombo, Thea Aarrestad, Sioni Summers, Maurizio Pierini, Giuseppe Di Guglielmo, Jennifer Ngadiuba, Javier Campos, Ben Hawks, Abhijith Gandrakota, Farah Fahim, Nhan Tran, George Constantinides, Zhiqiang Que, Wayne Luk, Alexander Tapper, Duc Hoang, Noah Paladino, Philip Harris, Bo-Cheng Lai, Manuel Valentin, Ryan Forelli, Seda Ogrenci, Lino Gerlach, Rian Flynn, Mia Liu, Daniel Diaz, Elham Khoda, Melissa Quinnan, Russell Solares, Santosh Parajuli, Mark Neubauer, Christian Herwig, Ho Fung Tsoi, Dylan Rankin, Shih-Chieh Hsu, Scott Hauck

TL;DR

hls4ml addresses the gap between modern DL frameworks and FPGA/ASIC deployment by translating trained models into HLS-compatible code. Its compiler-inspired workflow combines modular front ends (Keras, PyTorch, ONNX), a unifying IR, optimizer passes, and diverse back ends (Vitis, oneAPI, Catapult) to deliver low-latency, resource-aware hardware designs. The framework supports quantization-aware techniques (QKeras, HGQ), distributed arithmetic, and hardware-aware pruning, enabling rapid co-design of models and hardware across FPGA and ASIC targets. Demonstrations across jet tagging, SVHN, MNIST, and other domains, plus a rich ecosystem of co-design tools and SoC integration, showcase hls4ml as a versatile open-source platform for efficient neural-network acceleration on reconfigurable hardware.

Abstract

We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.

hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware

TL;DR

hls4ml addresses the gap between modern DL frameworks and FPGA/ASIC deployment by translating trained models into HLS-compatible code. Its compiler-inspired workflow combines modular front ends (Keras, PyTorch, ONNX), a unifying IR, optimizer passes, and diverse back ends (Vitis, oneAPI, Catapult) to deliver low-latency, resource-aware hardware designs. The framework supports quantization-aware techniques (QKeras, HGQ), distributed arithmetic, and hardware-aware pruning, enabling rapid co-design of models and hardware across FPGA and ASIC targets. Demonstrations across jet tagging, SVHN, MNIST, and other domains, plus a rich ecosystem of co-design tools and SoC integration, showcase hls4ml as a versatile open-source platform for efficient neural-network acceleration on reconfigurable hardware.

Abstract

We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.

Paper Structure

This paper contains 39 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Model conversion and compilation flow in hls4ml.
  • Figure 2: Comparison of the representation of an example model consisting of a one-layer MLP with a softmax output layer in QKeras (left) and QONNX (right).
  • Figure 3: (a) Illustration of the effect of different RF values for the outer product of two two-vectors. (b) Example CMVM in hls4ml with the Resource strategy. Given a linear layer with $N$ inputs, $M$ outputs and reuse factor RF, there will be $P = \frac{M \cdot N}{\text{RF}}$ multipliers operating in parallel. In each clock cycle, the control logic selects $P$ out of the $N$ inputs and feeds them to the multipliers, with wrap around if $P > N$. The $N \times M$ kernel is reshaped and mapped to on-chip memories such that $P$ elements can be accessed in parallel in each clock cycle. The products are accumulated accordingly at the precision specified to form the output.
  • Figure 4: Schematics of the computation of an MLP model implemented using parallel data transfer (left) and a CNN model implemented using streaming data transfer (right). In the Resource strategy, the number of parallel MAC operations executed in each cycle is determined by the RF and PF. In the case of the MLP, $\frac{M \cdot N}{\text{RF}}$ multiplications are executed in parallel each clock cycle.