Table of Contents
Fetching ...

Hardware optimization on Android for inference of AI models

Iulius Gherasim, Carlos García Sánchez

TL;DR

This work tackles the challenge of optimizing on-device Android inference for AI models by benchmarking ResNet and YOLO across a heterogeneous SoC (CPU, GPU, NPU) on a Galaxy Tab S9 using multiple quantization schemes (FP32, FP16, INT8, DYN). The authors implement a mixed software pipeline (PyTorch→ONNX→onnx2tf→TFLite via LiteRT) and evaluate latency and accuracy on ImageNet and COCO datasets across five execution configurations, supplemented by a Pareto-front analysis to identify optimal trade-offs. Key findings show the NPU delivers substantial speedups (up to ~298x for large YOLO variants) and that INT8 quantization on ResNet yields the best speed-accuracy balance, while YOLO models experience more accuracy loss with INT8 and are less sensitive to FP16 gains on GPU. Dynamic quantization generally preserves accuracy but is hampered by NPU compatibility, and conversion steps introduce non-negligible accuracy loss. These results provide practical guidance for deploying edge vision on Android, highlighting accelerator selection and quantization strategies, while outlining future work on power-aware optimization and broader model coverage.

Abstract

The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

Hardware optimization on Android for inference of AI models

TL;DR

This work tackles the challenge of optimizing on-device Android inference for AI models by benchmarking ResNet and YOLO across a heterogeneous SoC (CPU, GPU, NPU) on a Galaxy Tab S9 using multiple quantization schemes (FP32, FP16, INT8, DYN). The authors implement a mixed software pipeline (PyTorch→ONNX→onnx2tf→TFLite via LiteRT) and evaluate latency and accuracy on ImageNet and COCO datasets across five execution configurations, supplemented by a Pareto-front analysis to identify optimal trade-offs. Key findings show the NPU delivers substantial speedups (up to ~298x for large YOLO variants) and that INT8 quantization on ResNet yields the best speed-accuracy balance, while YOLO models experience more accuracy loss with INT8 and are less sensitive to FP16 gains on GPU. Dynamic quantization generally preserves accuracy but is hampered by NPU compatibility, and conversion steps introduce non-negligible accuracy loss. These results provide practical guidance for deploying edge vision on Android, highlighting accelerator selection and quantization strategies, while outlining future work on power-aware optimization and broader model coverage.

Abstract

The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

Paper Structure

This paper contains 11 sections, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Experimental Workflow and Model Preparation Pipeline
  • Figure 3: YOLOv8n inference speeds across devices & quantizations
  • Figure 4: YOLO11n inference speeds across devices & quantizations
  • Figure 5: YOLO11 vs YOLOv8 accuracy comparison
  • Figure 6: Pareto Front ResNet
  • ...and 1 more figures