MING: An Automated CNN-to-Edge MLIR HLS framework
Jiahong Bi, Lars Schütze, Jeronimo Castrillon
TL;DR
MING presents a streaming, MLIR-based framework for edge-targeted HLS design that directly addresses resource constraints typical of FPGA edge devices. By classifying kernels into pure parallel, regular reduction, and sliding-window types, and by constructing a fully streaming dataflow with line buffers, MING eliminates large intermediate tensors and reduces on-chip memory usage. It couples a hardware-aware optimization flow with an automatic design-space exploration that minimizes cycles under DSP/BRAM/stream constraints, yielding large performance gains over baseline and prior streaming frameworks. The approach demonstrates strong edge suitability on Kria KV260 with 8-bit quantized kernels, achieving substantial speedups and reduced resource usage, and points toward integration with complementary tools for broader model families and architectures.
Abstract
Driven by the increasing demand for low-latency and real-time processing, machine learning applications are steadily migrating toward edge computing platforms, where Field-Programmable Gate Arrays (FPGAs) are widely adopted for their energy efficiency compared to CPUs and GPUs. To generate high-performance and low-power FPGA designs, several frameworks built upon High Level Synthesis (HLS) vendor tools have been proposed, among which MLIR-based frameworks are gaining significant traction due to their extensibility and ease of use. However, existing state-of-the-art frameworks often overlook the stringent resource constraints of edge devices. To address this limitation, we propose MING, an Multi-Level Intermediate Representation (MLIR)-based framework that abstracts and automates the HLS design process. Within this framework, we adopt a streaming architecture with carefully managed buffers, specifically designed to handle resource constraints while ensuring low-latency. In comparison with recent frameworks, our approach achieves on average 15x speedup for standard Convolutional Neural Network (CNN) kernels with up to four layers, and up to 200x for single-layer kernels. For kernels with larger input sizes, MING is capable of generating efficient designs that respect hardware resource constraints, whereas state-of-the-art frameworks struggle to meet.
