CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing
G Abarajithan, Zhenghua Ma, Ravidu Munasinghe, Francesco Restuccia, Ryan Kastner
TL;DR
This work tackles the need for high-performance, programmable neural-network inference at the scientific edge, where latency and energy efficiency are paramount and models continue to grow in size. It introduces cgra4ml, an open-source, modular framework that generates parameterizable CGRA hardware (SystemVerilog RTL), a portable C runtime, and automated firmware, all supported by end-to-end verification and vendor-agnostic toolchains for FPGA and ASIC deployment. The key contributions include a GPU-to-hardware deployment workflow via bundles, a dynamic, dataflow-tuned CGRA engine with AXI/DMA integration, and a comprehensive verification suite that ensures 100% bit-parity with a CPU/GPU golden run. The framework enables larger, sub-8-bit quantized models to run efficiently on both FPGA and ASIC backends, with demonstrated deployments on ResNet-50, PointNet, and Ibex-based SoCs, offering a practical path from model concept to silicon-ready deployment for scientific computing at the edge.
Abstract
The scientific community increasingly relies on machine learning (ML) for near-sensor processing, leveraging its strengths in tasks such as pattern recognition, anomaly detection, and real-time decision-making. These deployments demand accelerators that combine extremely high performance with programmability, ease of integration, and straightforward verification. We present cgra4ml, an open-source, modular framework that generates parameterizable CGRA accelerators in synthesizable SystemVerilog RTL, tailored to common ML compute patterns found in scientific applications. The framework supports seamless system integration through AXI-compliant interfaces and open-source DMA components, and it includes automatic firmware generation for programming the accelerator. A comprehensive verification suite and a runtime firmware stack further support deployment across diverse SoC platforms. cgra4ml provides a modular, full-stack infrastructure, including a Python API, SystemVerilog hardware, TCL toolflows, and a C runtime, which facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than dealing with the intricacies of hardware design and optimization. We demonstrate the effectiveness of cgra4ml to implement common scientific edge neural networks using ASIC and FPGA design flows.
