Table of Contents
Fetching ...

rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA

Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien

TL;DR

The paper addresses the bottleneck of requiring full synthesis to estimate FPGA resource usage and latency for neural networks. It develops ML-based predictors trained on a large HLS4ML-generated FCNN dataset to forecast BRAM, DSP, FF, LUT usage and inference cycles before synthesis, enabling rapid feasibility checks. An MLP-based approach with engineered features achieves strong validation performance, with $R^2$ values between $0.8$ and $0.98$ and $sMAPE$ between $10\%$ and $30\%$, and latency errors typically under $100$ clock cycles (often below $1\mu$s at $10\text{ns}$ clocks). The work, released publicly with data and code, aims to accelerate FPGA ML prototyping and planning, with plans to extend to CNNs and RNNs and to incorporate implementation-level reports.

Abstract

Implementing Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously improving detectors. However, developing ML models for FPGAs is time-consuming, as optimization requires synthesis to evaluate FPGA area and latency, making the process slow and repetitive. This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA. We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code, to synthesize a diverse dataset of NN architectures and train resource utilization and inference latency predictors. While HLS4ML requires full synthesis to obtain resource and latency insights, our method uses trained regression models for immediate pre-synthesis predictions. The prediction models estimate the usage of Block RAM (BRAM), Digital Signal Processors (DSP), Flip-Flops (FF), and Look-Up Tables (LUT), as well as the inference clock cycles. The predictors were evaluated on both synthetic and existing benchmark architectures and demonstrated high accuracy with R2 scores ranging between 0.8 and 0.98 on the validation set and sMAPE values between 10% and 30%. Overall, our approach provides valuable preliminary insights, enabling users to quickly assess the feasibility and efficiency of NNs on FPGAs, accelerating the development and deployment processes. The open-source repository can be found at https://github.com/IMPETUS-UdeS/rule4ml, while the datasets are publicly available at https://borealisdata.ca/dataverse/rule4ml.

rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA

TL;DR

The paper addresses the bottleneck of requiring full synthesis to estimate FPGA resource usage and latency for neural networks. It develops ML-based predictors trained on a large HLS4ML-generated FCNN dataset to forecast BRAM, DSP, FF, LUT usage and inference cycles before synthesis, enabling rapid feasibility checks. An MLP-based approach with engineered features achieves strong validation performance, with values between and and between and , and latency errors typically under clock cycles (often below s at clocks). The work, released publicly with data and code, aims to accelerate FPGA ML prototyping and planning, with plans to extend to CNNs and RNNs and to incorporate implementation-level reports.

Abstract

Implementing Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously improving detectors. However, developing ML models for FPGAs is time-consuming, as optimization requires synthesis to evaluate FPGA area and latency, making the process slow and repetitive. This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA. We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code, to synthesize a diverse dataset of NN architectures and train resource utilization and inference latency predictors. While HLS4ML requires full synthesis to obtain resource and latency insights, our method uses trained regression models for immediate pre-synthesis predictions. The prediction models estimate the usage of Block RAM (BRAM), Digital Signal Processors (DSP), Flip-Flops (FF), and Look-Up Tables (LUT), as well as the inference clock cycles. The predictors were evaluated on both synthetic and existing benchmark architectures and demonstrated high accuracy with R2 scores ranging between 0.8 and 0.98 on the validation set and sMAPE values between 10% and 30%. Overall, our approach provides valuable preliminary insights, enabling users to quickly assess the feasibility and efficiency of NNs on FPGAs, accelerating the development and deployment processes. The open-source repository can be found at https://github.com/IMPETUS-UdeS/rule4ml, while the datasets are publicly available at https://borealisdata.ca/dataverse/rule4ml.
Paper Structure (11 sections, 7 figures, 7 tables)

This paper contains 11 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: HLS4ML workflow.
  • Figure 2: Correlation matrix showing the interdependence between variables.
  • Figure 3: Training and validation loss and metrics. The MAE loss for BRAM, DSP, FF, and LUT is a percentage relative to the board's available resources. Standard MAE is used for clock cycles.
  • Figure 4: Prediction errors illustrated as box plots: (a) resource prediction errors and (b) latency prediction errors (b). The y-axis is broken to better display both the boxes and the outliers, ensuring the scale accommodates accordingly.
  • Figure 5: Comparing ground truth (G) and prediction (P) trends of resource utilization across different synthesis parameters on the ZCU102. The values are averaged across the benchmark NNs.
  • ...and 2 more figures