Table of Contents
Fetching ...

Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls

Sicong Liu, Wentao Zhou, Zimu Zhou, Bin Guo, Minfan Wang, Cheng Fang, Zheng Lin, Zhiwen Yu

TL;DR

The paper investigates the practical potential and pitfalls of DL inference across heterogeneous mobile processors. It presents a two-level optimization framework (frontend DAG-based pruning and backend operator distribution/memory allocation) and conducts an extensive empirical study across eight devices, multiple DL models, and dynamic workloads. Key findings show that no single strategy fits all contexts, with frequent operator unsupported fallbacks and substantial gains achievable only through cross-level optimization and adaptive scheduling. The work highlights the importance of runtime latency profiling and data reuse-aware backend design to realize real-world performance improvements in mobile environments.

Abstract

There is a growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications. Equipped with a variety of processing units such as CPUs, GPUs, and NPUs, the mobile devices hold potential to accelerate DL inference via parallel execution across heterogeneous processors. Various efficient parallel methods have been explored to optimize computation distribution, achieve load balance, and minimize communication cost across processors. Yet their practical effectiveness in the dynamic and diverse real-world mobile environment is less explored. This paper presents a holistic empirical study to assess the capabilities and challenges associated with parallel DL inference on heterogeneous mobile processors. Through carefully designed experiments covering various DL models, mobile software/hardware environments, workload patterns, and resource availability, we identify limitations of existing techniques and highlight opportunities for cross-level optimization.

Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls

TL;DR

The paper investigates the practical potential and pitfalls of DL inference across heterogeneous mobile processors. It presents a two-level optimization framework (frontend DAG-based pruning and backend operator distribution/memory allocation) and conducts an extensive empirical study across eight devices, multiple DL models, and dynamic workloads. Key findings show that no single strategy fits all contexts, with frequent operator unsupported fallbacks and substantial gains achievable only through cross-level optimization and adaptive scheduling. The work highlights the importance of runtime latency profiling and data reuse-aware backend design to realize real-world performance improvements in mobile environments.

Abstract

There is a growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications. Equipped with a variety of processing units such as CPUs, GPUs, and NPUs, the mobile devices hold potential to accelerate DL inference via parallel execution across heterogeneous processors. Various efficient parallel methods have been explored to optimize computation distribution, achieve load balance, and minimize communication cost across processors. Yet their practical effectiveness in the dynamic and diverse real-world mobile environment is less explored. This paper presents a holistic empirical study to assess the capabilities and challenges associated with parallel DL inference on heterogeneous mobile processors. Through carefully designed experiments covering various DL models, mobile software/hardware environments, workload patterns, and resource availability, we identify limitations of existing techniques and highlight opportunities for cross-level optimization.
Paper Structure (13 sections, 5 figures, 4 tables)

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of parallel DL inference workflow across heterogeneous processors on mobile devices.
  • Figure 2: Latency on different mobile devices with diverse parallel inference strategies.
  • Figure 3: Operator support over diverse models.
  • Figure 4: Impact of parallel scheduling granularity.
  • Figure 5: GPU utilization and mobile user interaction responsiveness (e.g., frame drop rate) at runtime.