Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

Arne Symons; Linyan Mei; Steven Colleman; Pouya Houshmand; Sebastian Karl; Marian Verhelst

Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

Arne Symons, Linyan Mei, Steven Colleman, Pouya Houshmand, Sebastian Karl, Marian Verhelst

TL;DR

This work tackles the challenge of efficiently deploying ever-larger DNNs on edge hardware by introducing Stream, a design-space framework that co-explores layer fusion and heterogeneous dataflow accelerators. It combines a fine-grained workload graph generation (CTs) with a memory- and communication-aware latency model (COALA) and a constraint-based allocator/scheduler (WACO) to optimize across core allocations, dataflows, and fusion depth. The framework is validated against three state-of-the-art HDAs, showing up to $2.2\times$ improvements in energy-delay product over layer-by-layer scheduling and $32\%$ gains over GA-based approaches, while enabling broad architectural exploration under iso-area constraints. Stream is open-source and designed to adapt to evolving accelerator designs, offering a practical, scalable path to efficient edge inference with fused DNNs.

Abstract

As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called Stream. Stream captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to 2.2x lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: https://github.com/kuleuven-micas/stream.

Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

TL;DR

improvements in energy-delay product over layer-by-layer scheduling and

gains over GA-based approaches, while enabling broad architectural exploration under iso-area constraints. Stream is open-source and designed to adapt to evolving accelerator designs, offering a practical, scalable path to efficient edge inference with fused DNNs.

Abstract

Paper Structure (40 sections, 9 equations, 16 figures, 3 tables)

This paper contains 40 sections, 9 equations, 16 figures, 3 tables.

Introduction
Introduction
Dataflow Accelerator Architectures
Fine-Grained Scheduling Strategies
Contributions of this Work
Background & Related Works
Dataflow Accelerator Architectures
Workload Allocation, Scheduling & Mapping
Allocation
Scheduling
Mapping
Stream framework
Input: ONNX Workload
Input: HDA Architecture
Fine-grain Workload Graph Generation
...and 25 more sections

Figures (16)

Figure 1: A conceptual example showing different ways of scheduling a deep neural network workload onto different hardware accelerators.
Figure 2: Architectural schematics of dataflow accelerators: (a) Heterogeneous Dataflow Accelerator (HDA), (b) Weight Stationary (WS) core, and (c) Output Stationary (OS) core.
Figure 3: Overview of the Stream framework.
Figure 4: Stream models a graph of cores, with CommunicationLink objects attached to its edges, to represent a wide range of HDA architectures.
Figure 5: Each layer is partitioned into one or more Computation Tiles (CT) depending on the desired granularity.
...and 11 more figures

Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

TL;DR

Abstract

Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (16)