Table of Contents
Fetching ...

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

Benjamin Hawks, Jason Weitz, Dmitri Demler, Karla Tame-Narvaez, Dennis Plotnikov, Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien, Donovan Sproule, Elham E Khoda, Keegan A. Smith, Russell Marroquin, Giuseppe Di Guglielmo, Nhan Tran, Javier Duarte, Vladimir Loncar

TL;DR

wa-hls4ml tackles the delay in hardware resource and latency estimation for ML accelerators by providing a large-scale open dataset and a standardized benchmark, coupled with surrogate models (GNN and Transformer) that predict FPGA resources and latency directly from ML architectures synthesized with hls4ml. The approach eliminates lengthy C- and logic-synthesis steps in early design iterations, delivering rapid codesign feedback within seconds. The work demonstrates that GNN and Transformer surrogates can predict resources and latency within tight error margins on synthetic test sets and exposes generalization gaps when extrapolating to exemplar real-world models, motivating dataset expansion and model refinements. Overall, wa-hls4ml offers a practical, community-driven resource for accelerating FPGA-based ML deployment and guides future developments in dataset diversity and surrogate modeling for hardware-aware ML design.

Abstract

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

TL;DR

wa-hls4ml tackles the delay in hardware resource and latency estimation for ML accelerators by providing a large-scale open dataset and a standardized benchmark, coupled with surrogate models (GNN and Transformer) that predict FPGA resources and latency directly from ML architectures synthesized with hls4ml. The approach eliminates lengthy C- and logic-synthesis steps in early design iterations, delivering rapid codesign feedback within seconds. The work demonstrates that GNN and Transformer surrogates can predict resources and latency within tight error margins on synthetic test sets and exposes generalization gaps when extrapolating to exemplar real-world models, motivating dataset expansion and model refinements. Overall, wa-hls4ml offers a practical, community-driven resource for accelerating FPGA-based ML deployment and guides future developments in dataset diversity and surrogate modeling for hardware-aware ML design.

Abstract

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

Paper Structure

This paper contains 27 sections, 3 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: The traditional codesign workflow compared to the proposed surrogate model based codesign workflow.
  • Figure 2: All tracked output features plotted for each fully-connected model in the dataset versus Bit-Operations, with the color representing the reuse factor.
  • Figure 3: All tracked output features plotted for each convolutional model in the dataset versus Bit-Operations, with the color representing the reuse factor of a given sample
  • Figure 4: Exemplar vs the train and test subset resource (label) distributions
  • Figure 5: The overall structure of the GNN, comprising five GATv2Conv layers. The vector $(B, L, F)$ consists of $B$ batch size, $L$ layers per model, and $F$ features per layer.
  • ...and 13 more figures