Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

Ismet Dagli; Mehmet Belviranli

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

Ismet Dagli, Mehmet Belviranli

TL;DR

HaX-CoNN tackles the problem of concurrently executing multiple DNN inferences on memory-sharing heterogeneous SoCs by performing layer-centric profiling and contention-aware scheduling across DSAs. It introduces a two-part approach: (i) decoupled layer performance and inter-DSA transition characterization plus a PCCS-based memory-cont contention model, and (ii) a SAT-solver-based optimizer that yields optimal layer-to-DSA mappings under either throughput or latency objectives. The framework includes a dynamic variant, D-HaX-CoNN, to adapt schedules in real time as workloads evolve. Evaluated on NVIDIA AGX Orin/Xavier and Qualcomm Snapdragon 865, HaX-CoNN achieves up to 32% latency and 29% throughput improvements, with substantial reductions in shared memory contention, demonstrating practical impact for edge/robotic systems that run multiple DNN tasks concurrently.

Abstract

Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

TL;DR

Abstract

Paper Structure (36 sections, 11 equations, 7 figures, 8 tables)

This paper contains 36 sections, 11 equations, 7 figures, 8 tables.

Introduction
Related Work
HaX-CoNN: Heterogeneity-aware Execution of Concurrent Deep Neural Networks
Layer grouping
Per-layer performance and transition characterization
Layer characterization:
Inter-DSA layer transitions:
Characterizing shared memory contention
Formulating the problem
Objective functions:
Optimal and dynamic schedule generation
Experimental Setup
Computing platforms:
Applications:
Profiling:
...and 21 more sections

Figures (7)

Figure 1: Different ways of executing VGG-19 and ResNet-101 DNNs in parallel on Xavier AGX.
Figure 2: Overview of HaX-CoNN.
Figure 3: EMC utilization by conv layers on GPU and DLA with varying input (i) and filter (f) sizes
Figure 4: Illustration for a hypothetical execution of five layers from three DNNs running on three different accelerators. Colored regions indicate additional slowdowns each layer experiences for varying external memory pressure.
Figure 5: Throughput (FPS) comparison for Scenario 1: Multiple instances of the same DNN is run concurrently on NVIDIA AGX Orin.
...and 2 more figures

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

TL;DR

Abstract

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

Authors

TL;DR

Abstract

Table of Contents

Figures (7)