Energy-Efficient Vision Transformer Inference for Edge-AI Deployment
Nursultan Amanzhol, Jurn-Gyu Park
TL;DR
This work introduces E3P-ViT, a two-stage pipeline that first screens Vision Transformer models using a device-agnostic NetScore metric and then empirically evaluates energy, time, and accuracy on edge hardware using the Sustainable Accuracy Metric (SAM). By benchmarking 13 ViT variants on NVIDIA Jetson TX2 and RTX 3050 across ImageNet-1K and CIFAR-10, the study reveals a significant gap between theoretical efficiency and real-device energy use, with hybrid LeViT architectures excelling on TX2 and distilled TinyViT models dominating on RTX 3050. The results demonstrate that hardware-aware evaluation is essential for sustainable edge deployment, achieving up to 53% energy reductions in some cases and highlighting that top theoretical scores do not guarantee best device performance. The paper suggests extending E3P-ViT to more hardware, datasets, and ViT variants to further improve deployment recommendations for edge-AI systems.
Abstract
The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).
