Evaluating Multi-Instance DNN Inferencing on Multiple Accelerators of an Edge Device
Mumuksh Tayal, Yogesh Simmhan
TL;DR
The paper addresses the challenge of maximizing real-time DNN inference on edge devices by evaluating concurrent execution across heterogeneous accelerators on a Jetson Orin AGX. It systematically benchmarks multiple ResNet50 instances with varying batch sizes across CUDA Cores, Tensor Cores, and DLA using PyTorch and TensorRT, revealing throughput and latency trade-offs as resources contend. A key finding is that while combining CUDA Cores with Tensor Cores (AMP) or pairing CUDA Cores with DLA can improve throughput at specific batch sizes, resource contention degrades performance when all three accelerators are used, and accuracy remains around $76\%$ across configurations. The work highlights the need for intelligent scheduling and workload allocation to optimize edge-device resource utilization in heterogeneous DNN inference scenarios.
Abstract
Edge devices like Nvidia Jetson platforms now offer several on-board accelerators -- including GPU CUDA cores, Tensor Cores, and Deep Learning Accelerators (DLA) -- which can be concurrently exploited to boost deep neural network (DNN) inferencing. In this paper, we extend previous work by evaluating the performance impacts of running multiple instances of the ResNet50 model concurrently across these heterogeneous components. We detail the effects of varying batch sizes and hardware combinations on throughput and latency. Our expanded analysis highlights not only the benefits of combining CUDA and Tensor Cores, but also the performance degradation from resource contention when integrating DLAs. These findings, together with insights on precision constraints and workload allocation challenges, motivate further exploration of intelligent scheduling mechanisms to optimize resource utilization on edge platforms.
