A Study on Inference Latency for Vision Transformers on Mobile Devices
Zhuojin Li, Marco Paolieri, Leana Golubchik
TL;DR
This work analyzes inference latency of Vision Transformers on mobile devices, comparing them to CNNs and building a large-scale latency dataset of 190 real-world and 1000 synthetic ViTs across 6 devices and 2 frameworks. It identifies that ViTs generally incur higher latency at comparable FLOPs, are more memory-bound, and have larger memory footprints, with GELU activation latency varying with input values and memory-format effects significantly altering performance. To enable practical deployment, the authors train latency predictors (Lasso, RF, and especially GBDT) on synthetic ViTs and demonstrate accurate end-to-end latency estimation for both synthetic and real-world ViTs, supporting tasks like Neural Architecture Search and collaborative (split) inference. The resulting dataset and predictors offer a practical pathway to fast, on-device ViT design and deployment, with future work targeting additional accelerators and background-task effects.
Abstract
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
