CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

G Abarajithan; Zhenghua Ma; Ravidu Munasinghe; Francesco Restuccia; Ryan Kastner

CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

G Abarajithan, Zhenghua Ma, Ravidu Munasinghe, Francesco Restuccia, Ryan Kastner

TL;DR

This work tackles the need for high-performance, programmable neural-network inference at the scientific edge, where latency and energy efficiency are paramount and models continue to grow in size. It introduces cgra4ml, an open-source, modular framework that generates parameterizable CGRA hardware (SystemVerilog RTL), a portable C runtime, and automated firmware, all supported by end-to-end verification and vendor-agnostic toolchains for FPGA and ASIC deployment. The key contributions include a GPU-to-hardware deployment workflow via bundles, a dynamic, dataflow-tuned CGRA engine with AXI/DMA integration, and a comprehensive verification suite that ensures 100% bit-parity with a CPU/GPU golden run. The framework enables larger, sub-8-bit quantized models to run efficiently on both FPGA and ASIC backends, with demonstrated deployments on ResNet-50, PointNet, and Ibex-based SoCs, offering a practical path from model concept to silicon-ready deployment for scientific computing at the edge.

Abstract

The scientific community increasingly relies on machine learning (ML) for near-sensor processing, leveraging its strengths in tasks such as pattern recognition, anomaly detection, and real-time decision-making. These deployments demand accelerators that combine extremely high performance with programmability, ease of integration, and straightforward verification. We present cgra4ml, an open-source, modular framework that generates parameterizable CGRA accelerators in synthesizable SystemVerilog RTL, tailored to common ML compute patterns found in scientific applications. The framework supports seamless system integration through AXI-compliant interfaces and open-source DMA components, and it includes automatic firmware generation for programming the accelerator. A comprehensive verification suite and a runtime firmware stack further support deployment across diverse SoC platforms. cgra4ml provides a modular, full-stack infrastructure, including a Python API, SystemVerilog hardware, TCL toolflows, and a C runtime, which facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than dealing with the intricacies of hardware design and optimization. We demonstrate the effectiveness of cgra4ml to implement common scientific edge neural networks using ASIC and FPGA design flows.

CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

TL;DR

Abstract

Paper Structure (41 sections, 6 equations, 14 figures, 6 tables)

This paper contains 41 sections, 6 equations, 14 figures, 6 tables.

Introduction
Background and Related Work
Neural Networks for Scientific Computing
Dataflow-style NN Implementation: hls4ml, finn
DNN Accelerators
CGRA
ML to FPGA/ASIC Frontend Frameworks
Motivation for cgra4ml
cgra4ml Overview
Model Definition and Training
Accelerator Configuration
RTL Generation
Firmware Generation
Verification
FPGA/ASIC Deployment
...and 26 more sections

Figures (14)

Figure 1: Data processing pipeline at Large Hadron Collider, CERN. The detectors at LHC produce data at a rate of 320 Tb/s during collisions. Due to the infeasibility of transferring data at such a high rate, the events are filtered through a multi-step pipeline. Small neural networks like autoencoders are implemented using hls4ml on radiation-hardened ASICs. More sophisticated models are implemented on FPGAs (L1) and servers (L2) to filter the data fully. With cgra4ml enabling larger models at low power, more processing can be moved to the edge (towards the left), resulting in more robust filtering.
Figure 2: Layer-by-layer dataflow-style implementation of neural networks using hls4ml. hls4ml offers multiple backends, one per family of devices. Each backend is a collection of layers defined as templated HLS. Each neural network layer is implemented as a separate datapath. Xilinx DMA IPs are connected to the HLS4ML design to move data in and out. finn has a similar dataflow-style implementation. In contrast, cgra4ml reuses the same CGRA multiple times to process a layer.
Figure 3: Positioning CGRA4ML in the space of ML-to-FPGA frameworks from the perspective of the scientific computing community. hls4mlhls4ml is the most popular tool in the scientific computing community, as it is user-friendly and supports the dataflow-style implementation of models of varying bitwidths and layers to an extent. finnumuroglu2017finn implements models in a similar dataflow fashion, but excels at very low bitwidths. Traditional AI accelerators are suitable for large models with 8+ bit quantization. cgra4ml aims to fill the gap by making models with sub 8-bit quantization that are too big to implement with hls4ml and finn.
Figure 4: cgra4ml workflow as outlined in Sec. \ref{['sec:overview']}. Users first build quantized neural networks and train them in a quantization-aware manner using qkeras (Sec. \ref{['subsec:frontend']}). Users then define a CGRA definition, which generates vendor-agnostic SystemVerilog RTL hardware specifications and TCL tool flows (Sec. \ref{['subsec:frontend']}). The model can then be exported to generate weights and a C runtime firmware (Sec. \ref{['sec:firmware']}). The generated hardware IP is then verified comprehensively with the model and firmware (Sec. \ref{['subsec:verf']}) using our randomized, transactional SystemVerilog testbench suite with DPI-C extensions. Finally, the bitstream generated from the FPGA toolflow and the C firmware can be loaded into an FPGA to be tested in seconds (Sec. \ref{['sec:fpga']}). After such rapid prototyping, the same hardware design can be moved to ASIC (Sec. \ref{['sec:asic']}).
Figure 5: Bundle: A given DNN is decomposed into a list of bundles, where a bundle is a group of layers that can be deterministically executed by the system generated by cgra4ml. The CGRA accelerates the simple but compute-heavy operations, while the CPU executes the complex but lightweight pixel-wise operations. This flexibility allows us to add new, complex operations easily.
...and 9 more figures

CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

TL;DR

Abstract

CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

Authors

TL;DR

Abstract

Table of Contents

Figures (14)