Table of Contents
Fetching ...

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

Fabian Kreß, El Mahdi El Annabi, Tim Hotfilter, Julian Hoefer, Tanja Harbaum, Juergen Becker

TL;DR

This work tackles the challenge of efficiently mapping large DNN inference onto distributed embedded systems with multiple accelerators. It introduces a graph-based, hardware-aware design space exploration framework that automatically identifies partition points, filters candidates by memory and link constraints, and evaluates accuracy under quantization before mapping to hardware and selecting Pareto-optimal solutions via NSGA-II. The approach is validated across six CNNs, showing substantial throughput improvements (e.g., up to 47.5% for EfficientNet-B0) and revealing meaningful memory and energy trade-offs as partitioning points vary; increasingly, partitioning across more accelerators proves beneficial for large architectures. Overall, the framework demonstrates the value of holistic hardware/software co-design for energy-efficient, high-throughput inference in distributed embedded systems used in robotics and autonomous applications.

Abstract

Distributed systems can be found in various applications, e.g., in robotics or autonomous driving, to achieve higher flexibility and robustness. Thereby, data flow centric applications such as Deep Neural Network (DNN) inference benefit from partitioning the workload over multiple compute nodes in terms of performance and energy-efficiency. However, mapping large models on distributed embedded systems is a complex task, due to low latency and high throughput requirements combined with strict energy and memory constraints. In this paper, we present a novel approach for hardware-aware layer scheduling of DNN inference in distributed embedded systems. Therefore, our proposed framework uses a graph-based algorithm to automatically find beneficial partitioning points in a given DNN. Each of these is evaluated based on several essential system metrics such as accuracy and memory utilization, while considering the respective system constraints. We demonstrate our approach in terms of the impact of inference partitioning on various performance metrics of six different DNNs. As an example, we can achieve a 47.5 % throughput increase for EfficientNet-B0 inference partitioned onto two platforms while observing high energy-efficiency.

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

TL;DR

This work tackles the challenge of efficiently mapping large DNN inference onto distributed embedded systems with multiple accelerators. It introduces a graph-based, hardware-aware design space exploration framework that automatically identifies partition points, filters candidates by memory and link constraints, and evaluates accuracy under quantization before mapping to hardware and selecting Pareto-optimal solutions via NSGA-II. The approach is validated across six CNNs, showing substantial throughput improvements (e.g., up to 47.5% for EfficientNet-B0) and revealing meaningful memory and energy trade-offs as partitioning points vary; increasingly, partitioning across more accelerators proves beneficial for large architectures. Overall, the framework demonstrates the value of holistic hardware/software co-design for energy-efficient, high-throughput inference in distributed embedded systems used in robotics and autonomous applications.

Abstract

Distributed systems can be found in various applications, e.g., in robotics or autonomous driving, to achieve higher flexibility and robustness. Thereby, data flow centric applications such as Deep Neural Network (DNN) inference benefit from partitioning the workload over multiple compute nodes in terms of performance and energy-efficiency. However, mapping large models on distributed embedded systems is a complex task, due to low latency and high throughput requirements combined with strict energy and memory constraints. In this paper, we present a novel approach for hardware-aware layer scheduling of DNN inference in distributed embedded systems. Therefore, our proposed framework uses a graph-based algorithm to automatically find beneficial partitioning points in a given DNN. Each of these is evaluated based on several essential system metrics such as accuracy and memory utilization, while considering the respective system constraints. We demonstrate our approach in terms of the impact of inference partitioning on various performance metrics of six different DNNs. As an example, we can achieve a 47.5 % throughput increase for EfficientNet-B0 inference partitioned onto two platforms while observing high energy-efficiency.
Paper Structure (13 sections, 4 equations, 3 figures, 2 tables)

This paper contains 13 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our proposed framework. First, a graph is generated based on the description. After filtering of potential partitioning points considering memory and link constraints, quantization is performed and evaluated. Finally, the framework estimates performance on hardware and selects a Pareto-optimal point.
  • Figure 2: Selected exploration results for a system consisting of an Eyeriss-like (EYR) accelerator in platform A and a Simba-like (SMB) accelerator in platform B linked via Gigabit Ethernet. The orange and blue squares mark the cases in which the inference is performed either completely on platform A or B, triangles highlight beneficial solutions.
  • Figure 3: EfficientNet-B0 results of the analysis of memory resources for a system consisting of two 16-bit platform architectures A and B.

Theorems & Definitions (4)

  • Definition 1: Partitioning Point
  • Definition 2: Minimization Problem
  • Definition 3: Memory Size
  • Definition 4: Throughput