Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-Execution
Yunquan Gao, Zhiguo Zhang, Praveen Kumar Donta, Chinmaya Kumar Dehury, Xiujun Wang, Dusit Niyato, Qiyang Zhang
TL;DR
ADMS tackles the challenge of running multiple DNNs concurrently on mobile devices with heterogeneous processors by combining offline subgraph partitioning and online processor-state-aware scheduling. It introduces a Model Analyzer that creates hardware-friendly subgraphs using a window size parameter, a Hardware Monitor that feeds real-time device status, and a Scheduler that optimizes task assignment via a multi-factor priority model. Empirical results on Redmi K50 Pro and Huawei P20 show up to 4.04x latency reduction compared with TFLite and a 24.2% improvement in energy efficiency over Band, along with enhanced thermal stability and robustness under stress. The work demonstrates that fine-grained subgraph scheduling coupled with dynamic, hardware-aware coordination can unlock substantial performance gains for real-world multi-DNN mobile workloads.
Abstract
Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving demand for mobile device support. However, existing mobile inference frameworks often rely on a single processor per model, limiting hardware utilization and causing suboptimal performance and energy efficiency. Expanding DNN accessibility on mobile platforms requires adaptive, resource-efficient solutions to meet rising computational needs without compromising functionality. Parallel inference of multiple DNNs on heterogeneous processors remains challenging. Some works partition DNN operations into subgraphs for parallel execution across processors, but these often create excessive subgraphs based only on hardware compatibility, increasing scheduling complexity and memory overhead. To address this, we propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy for optimizing multi-DNN inference on mobile heterogeneous processors. ADMS constructs an optimal subgraph partitioning strategy offline, balancing hardware operation support and scheduling granularity, and uses a processor-state-aware algorithm to dynamically adjust workloads based on real-time conditions. This ensures efficient workload distribution and maximizes processor utilization. Experiments show ADMS reduces multi-DNN inference latency by 4.04 times compared to vanilla frameworks.
