Table of Contents
Fetching ...

Mobiprox: Supporting Dynamic Approximate Computing on Mobiles

Matevž Fabjančič, Octavian Machidon, Hashim Sharif, Yifan Zhao, Saša Misailović, Veljko Pejović

TL;DR

Mobiprox addresses the need for context-aware, runtime-adaptive compression of mobile deep learning models to cope with changing resource and input conditions. It presents an end-to-end pipeline that offline-identifies Pareto-optimal approximation configurations, profiles them on Android devices via a custom OpenCL tensor runtime, and enables dynamic adaptation through state- and confidence-based strategies. Empirical results in HAR and SKR domains show up to 15% system-wide energy savings with minimal or no loss in accuracy, while server-side tuning can achieve faster speedups than on-device execution. The work demonstrates the practicality of on-device adaptive approximate computing and highlights pathways for integration with production mobile DL stacks to further improve efficiency.

Abstract

Runtime-tunable context-dependent network compression would make mobile deep learning (DL) adaptable to often varying resource availability, input "difficulty", or user needs. The existing compression techniques significantly reduce the memory, processing, and energy tax of DL, yet, the resulting models tend to be permanently impaired, sacrificing the inference power for reduced resource usage. The existing tunable compression approaches, on the other hand, require expensive re-training, do not support arbitrary strategies for adapting the compression and do not provide mobile-ready implementations. In this paper we present Mobiprox, a framework enabling mobile DL with flexible precision. Mobiprox implements tunable approximations of tensor operations and enables runtime-adaptable approximation of individual network layers. A profiler and a tuner included with Mobiprox identify the most promising neural network approximation configurations leading to the desired inference quality with the minimal use of resources. Furthermore, we develop control strategies that depending on contextual factors, such as the input data difficulty, dynamically adjust the approximation levels across a mobile DL model's layers. We implement Mobiprox in Android OS and through experiments in diverse mobile domains, including human activity recognition and spoken keyword detection, demonstrate that it can save up to 15% system-wide energy with a minimal impact on the inference accuracy.

Mobiprox: Supporting Dynamic Approximate Computing on Mobiles

TL;DR

Mobiprox addresses the need for context-aware, runtime-adaptive compression of mobile deep learning models to cope with changing resource and input conditions. It presents an end-to-end pipeline that offline-identifies Pareto-optimal approximation configurations, profiles them on Android devices via a custom OpenCL tensor runtime, and enables dynamic adaptation through state- and confidence-based strategies. Empirical results in HAR and SKR domains show up to 15% system-wide energy savings with minimal or no loss in accuracy, while server-side tuning can achieve faster speedups than on-device execution. The work demonstrates the practicality of on-device adaptive approximate computing and highlights pathways for integration with production mobile DL stacks to further improve efficiency.

Abstract

Runtime-tunable context-dependent network compression would make mobile deep learning (DL) adaptable to often varying resource availability, input "difficulty", or user needs. The existing compression techniques significantly reduce the memory, processing, and energy tax of DL, yet, the resulting models tend to be permanently impaired, sacrificing the inference power for reduced resource usage. The existing tunable compression approaches, on the other hand, require expensive re-training, do not support arbitrary strategies for adapting the compression and do not provide mobile-ready implementations. In this paper we present Mobiprox, a framework enabling mobile DL with flexible precision. Mobiprox implements tunable approximations of tensor operations and enables runtime-adaptable approximation of individual network layers. A profiler and a tuner included with Mobiprox identify the most promising neural network approximation configurations leading to the desired inference quality with the minimal use of resources. Furthermore, we develop control strategies that depending on contextual factors, such as the input data difficulty, dynamically adjust the approximation levels across a mobile DL model's layers. We implement Mobiprox in Android OS and through experiments in diverse mobile domains, including human activity recognition and spoken keyword detection, demonstrate that it can save up to 15% system-wide energy with a minimal impact on the inference accuracy.
Paper Structure (24 sections, 1 equation, 8 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 1 equation, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Perforated convolution. Coloured sections indicate convolution coordinates. Dashed squares indicate the area of the first and the final convolution.
  • Figure 2: Mobiprox overview. OpenCL run-time supports running the inference binary (controlled either directly from the C code, or via JNI from the main Java/Kotlin app) with a varying level of approximation. The HPVM Profiler for Android helps us chart the approximation -- resource usage space, so that the Approximation adaptation strategy wihtin the Android app can set the approximation level dynamically at runtime. Main Mobiprox modules are colored green, while the supporting pre-existing modules are grayed out.
  • Figure 3: Comparison of the achieved speedup and the resulting QoS (inference accuracy) loss for approximation configurations selected by the on-server tuning with the same configurations ran on a mobile platform. Note the different scaling of the y-axis.
  • Figure 4: System-wide energy consumption (relative to no approximation) of an ASUS TinkerBoard S running inference on NNs trained different datasets. Different point types correspond to different NN architectures; each point represents a single approx. configuration. The x-axis represents the actual QoS loss from the model deployed on a mobile device.
  • Figure 5: Relative energy consumption compared to relative inference time reduction for mobilenet_uci-har at various approximation configurations. The x-axis shows the actual QoS loss from the model deployed on a mobile.
  • ...and 3 more figures