Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators
Arne Symons, Linyan Mei, Steven Colleman, Pouya Houshmand, Sebastian Karl, Marian Verhelst
TL;DR
This work tackles the challenge of efficiently deploying ever-larger DNNs on edge hardware by introducing Stream, a design-space framework that co-explores layer fusion and heterogeneous dataflow accelerators. It combines a fine-grained workload graph generation (CTs) with a memory- and communication-aware latency model (COALA) and a constraint-based allocator/scheduler (WACO) to optimize across core allocations, dataflows, and fusion depth. The framework is validated against three state-of-the-art HDAs, showing up to $2.2\times$ improvements in energy-delay product over layer-by-layer scheduling and $32\%$ gains over GA-based approaches, while enabling broad architectural exploration under iso-area constraints. Stream is open-source and designed to adapt to evolving accelerator designs, offering a practical, scalable path to efficient edge inference with fused DNNs.
Abstract
As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called Stream. Stream captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to 2.2x lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: https://github.com/kuleuven-micas/stream.
