Table of Contents
Fetching ...

Energy-Efficient Vision Transformer Inference for Edge-AI Deployment

Nursultan Amanzhol, Jurn-Gyu Park

TL;DR

This work introduces E3P-ViT, a two-stage pipeline that first screens Vision Transformer models using a device-agnostic NetScore metric and then empirically evaluates energy, time, and accuracy on edge hardware using the Sustainable Accuracy Metric (SAM). By benchmarking 13 ViT variants on NVIDIA Jetson TX2 and RTX 3050 across ImageNet-1K and CIFAR-10, the study reveals a significant gap between theoretical efficiency and real-device energy use, with hybrid LeViT architectures excelling on TX2 and distilled TinyViT models dominating on RTX 3050. The results demonstrate that hardware-aware evaluation is essential for sustainable edge deployment, achieving up to 53% energy reductions in some cases and highlighting that top theoretical scores do not guarantee best device performance. The paper suggests extending E3P-ViT to more hardware, datasets, and ViT variants to further improve deployment recommendations for edge-AI systems.

Abstract

The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).

Energy-Efficient Vision Transformer Inference for Edge-AI Deployment

TL;DR

This work introduces E3P-ViT, a two-stage pipeline that first screens Vision Transformer models using a device-agnostic NetScore metric and then empirically evaluates energy, time, and accuracy on edge hardware using the Sustainable Accuracy Metric (SAM). By benchmarking 13 ViT variants on NVIDIA Jetson TX2 and RTX 3050 across ImageNet-1K and CIFAR-10, the study reveals a significant gap between theoretical efficiency and real-device energy use, with hybrid LeViT architectures excelling on TX2 and distilled TinyViT models dominating on RTX 3050. The results demonstrate that hardware-aware evaluation is essential for sustainable edge deployment, achieving up to 53% energy reductions in some cases and highlighting that top theoretical scores do not guarantee best device performance. The paper suggests extending E3P-ViT to more hardware, datasets, and ViT variants to further improve deployment recommendations for edge-AI systems.

Abstract

The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).

Paper Structure

This paper contains 22 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Motivating Example: Comparison Between Device-Agnostic and Device-Related Metrics. While EfficientViT-B1 has fewer theoretical MACs and parameters (a) LeViT_Conv_192 achieves faster inference on hardware (b).
  • Figure 2: Methodology Overview: The Energy Efficiency Evaluation Pipeline for Vision Transformers (E3P-ViT).
  • Figure 3: The Pareto-based filtering applied on the initial set of 25 models.