Table of Contents
Fetching ...

A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li, Marco Paolieri, Leana Golubchik

TL;DR

This work analyzes inference latency of Vision Transformers on mobile devices, comparing them to CNNs and building a large-scale latency dataset of 190 real-world and 1000 synthetic ViTs across 6 devices and 2 frameworks. It identifies that ViTs generally incur higher latency at comparable FLOPs, are more memory-bound, and have larger memory footprints, with GELU activation latency varying with input values and memory-format effects significantly altering performance. To enable practical deployment, the authors train latency predictors (Lasso, RF, and especially GBDT) on synthetic ViTs and demonstrate accurate end-to-end latency estimation for both synthetic and real-world ViTs, supporting tasks like Neural Architecture Search and collaborative (split) inference. The resulting dataset and predictors offer a practical pathway to fast, on-device ViT design and deployment, with future work targeting additional accelerators and background-task effects.

Abstract

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

A Study on Inference Latency for Vision Transformers on Mobile Devices

TL;DR

This work analyzes inference latency of Vision Transformers on mobile devices, comparing them to CNNs and building a large-scale latency dataset of 190 real-world and 1000 synthetic ViTs across 6 devices and 2 frameworks. It identifies that ViTs generally incur higher latency at comparable FLOPs, are more memory-bound, and have larger memory footprints, with GELU activation latency varying with input values and memory-format effects significantly altering performance. To enable practical deployment, the authors train latency predictors (Lasso, RF, and especially GBDT) on synthetic ViTs and demonstrate accurate end-to-end latency estimation for both synthetic and real-world ViTs, supporting tasks like Neural Architecture Search and collaborative (split) inference. The resulting dataset and predictors offer a practical pathway to fast, on-device ViT design and deployment, with future work targeting additional accelerators and background-task effects.

Abstract

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

Paper Structure

This paper contains 25 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Search Space Design for Synthetic ViTs
  • Figure 2: Overview of Evaluated Architectures
  • Figure 3: End-to-End Latency Comparison
  • Figure 4: Latency Breakdown Comparison
  • Figure 5: Histograms of Arithmetic Intensity
  • ...and 11 more figures