Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips
Ismet Dagli, Mehmet Belviranli
TL;DR
HaX-CoNN tackles the problem of concurrently executing multiple DNN inferences on memory-sharing heterogeneous SoCs by performing layer-centric profiling and contention-aware scheduling across DSAs. It introduces a two-part approach: (i) decoupled layer performance and inter-DSA transition characterization plus a PCCS-based memory-cont contention model, and (ii) a SAT-solver-based optimizer that yields optimal layer-to-DSA mappings under either throughput or latency objectives. The framework includes a dynamic variant, D-HaX-CoNN, to adapt schedules in real time as workloads evolve. Evaluated on NVIDIA AGX Orin/Xavier and Qualcomm Snapdragon 865, HaX-CoNN achieves up to 32% latency and 29% throughput improvements, with substantial reductions in shared memory contention, demonstrating practical impact for edge/robotic systems that run multiple DNN tasks concurrently.
Abstract
Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.
