Stream-HLS: Towards Automatic Dataflow Acceleration
Suhail Basalama, Jason Cong
TL;DR
Stream-HLS introduces an MLIR-based framework that automatically transforms sequential multi-kernel software into optimized dataflow accelerators for FPGAs. It combines a dataflow-centric canonicalization and FIFO-sharing strategy with an accurate analytical performance model and MINLP-based global design space exploration to jointly optimize node-level pipelining, graph-level pipelining, and node-level parallelization. The approach yields large speedups over Vitis HLS baselines and prior automation frameworks across diverse benchmarks, including transformer, CNN, and MLP workloads, by effectively balancing resources and exploiting streaming data paths. The work delivers an open-source, modular solution that can extend to future workloads, with potential impact on rapid FPGA acceleration of complex multi-kernel applications.
Abstract
High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in hardware design. Further, the hardware design space, especially for multi-kernel applications, grows exponentially. Therefore, several HLS automation and abstraction frameworks have been proposed recently, but many issues remain unresolved. These issues include: 1) relying mainly on hardware directives (pragmas) to apply hardware optimizations without exploring loop scheduling opportunities. 2) targeting single-kernel applications only. 3) lacking automatic and/or global design space exploration. 4) missing critical hardware optimizations, such as graph-level pipelining for multi-kernel applications. To address these challenges, we propose a novel methodology and framework on top of the popular multi-level intermediate representation (MLIR) infrastructure called Stream-HLS. Our framework takes a C/C++ or PyTorch software code and automatically generates an optimized dataflow architecture along with host code for field-programmable gate arrays (FPGAs). To achieve this, we developed an accurate analytical performance model for global scheduling and optimization of dataflow architectures. Stream-HLS is evaluated using various standard HLS benchmarks and real-world benchmarks from transformer models, convolution neural networks, and multilayer perceptrons. Stream-HLS designs outperform the designs of prior state-of-the-art automation frameworks and manually-optimized designs of abstraction frameworks by up to $79.43\times$ and $10.62\times$ geometric means respectively. Finally, the Stream-HLS framework is modularized, extensible, and open-sourced at \url{https://github.com/UCLA-VAST/Stream-HLS} (\url{https://doi.org/10.5281/zenodo.14585909}).
