Table of Contents
Fetching ...

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Lei Xun, Jonathon Hare, Geoff V. Merrett

TL;DR

This work tackles the challenge of deploying DNN inference on mobile and embedded devices under dynamic hardware availability and varying application requirements. It introduces Dynamic Super-Networks (Dynamic OFA) to generate architecture-subnetworks tailored to heterogeneous cores without retraining, and couples this with a hierarchical runtime resource manager that jointly optimizes sub-network selection and DVFS at runtime. The key contributions are the system-level performance trade-off management framework, the dynamic OFA approach that delivers substantial speedups or accuracy gains with minimal memory and switching costs, and a runtime manager that achieves notable energy and latency improvements in both single-model and multi-model scenarios. Collectively, these ideas enable more efficient, adaptive on-device inference for diverse workloads on heterogeneous SoCs, with practical impact for latency-sensitive and energy-constrained applications.

Abstract

Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

TL;DR

This work tackles the challenge of deploying DNN inference on mobile and embedded devices under dynamic hardware availability and varying application requirements. It introduces Dynamic Super-Networks (Dynamic OFA) to generate architecture-subnetworks tailored to heterogeneous cores without retraining, and couples this with a hierarchical runtime resource manager that jointly optimizes sub-network selection and DVFS at runtime. The key contributions are the system-level performance trade-off management framework, the dynamic OFA approach that delivers substantial speedups or accuracy gains with minimal memory and switching costs, and a runtime manager that achieves notable energy and latency improvements in both single-model and multi-model scenarios. Collectively, these ideas enable more efficient, adaptive on-device inference for diverse workloads on heterogeneous SoCs, with practical impact for latency-sensitive and energy-constrained applications.

Abstract

Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.
Paper Structure (4 sections, 3 figures)

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: The high-level diagram of the proposed runtime system, which contains three abstract layers that are connected through knobs and monitors.
  • Figure 2: Dynamic Super-network. It samples and combines different efficient sub-network libraries from backbone super-networks for all heterogeneous cores, and to build dynamic neural networks without training.
  • Figure 3: The performance trade-offs between ImageNet Top-1 accuracy and latency of our Dynamic OFA model dynamic-ofa on the GPU of Jetson Xavier NX platform, comparing against SOTA static OFA backbone model cai2019once and dynamic DNN models yu2018slimmableyu2019universallyyu2019autoslimyang2021mutualnet. Dynamic OFA model is 2.4x faster (at similar accuracy) or has 5.1% higher Top-1 ImageNet accuracy (at similar latency) than AutoSlim-MnasNet yu2019autoslim.