Table of Contents
Fetching ...

Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference

Xiang Liu, Yijun Song, Xia Li, Yifei Sun, Huiying Lan, Zemin Liu, Linshan Jiang, Jialin Li

TL;DR

This work tackles the challenge of running Vision Transformers on resource-constrained edge devices by introducing ED-ViT, a framework that partitions a ViT into class-specific sub-models, prunes them to reduce size, assigns them across multiple edge devices, and fuses the results with a compact MLP. The authors present a four-step pipeline—model splitting, pruning, assignment, and fusion—along with greedy strategies and a fusion-based aggregation to achieve distributed inference while preserving accuracy. Extensive experiments across five datasets and three ViT architectures demonstrate substantial reductions in latency and model size (up to tens of times smaller) with minimal accuracy loss, and show favorable comparisons against CNN- and SNN-based split methods. The work offers a practical pathway to scalable edge deployment of large ViTs and lays groundwork for future integration with other horizontal model-compression techniques.

Abstract

Deep learning models are increasingly utilized on resource-constrained edge devices for real-time data analytics. Recently, Vision Transformer and their variants have shown exceptional performance in various computer vision tasks. However, their substantial computational requirements and low inference latency create significant challenges for deploying such models on resource-constrained edge devices. To address this issue, we propose a novel framework, ED-ViT, which is designed to efficiently split and execute complex Vision Transformers across multiple edge devices. Our approach involves partitioning Vision Transformer models into several sub-models, while each dedicated to handling a specific subset of data classes. To further reduce computational overhead and inference latency, we introduce a class-wise pruning technique that decreases the size of each sub-model. Through extensive experiments conducted on five datasets using three model architectures and actual implementation on edge devices, we demonstrate that our method significantly cuts down inference latency on edge devices and achieves a reduction in model size by up to 28.9 times and 34.1 times, respectively, while maintaining test accuracy comparable to the original Vision Transformer. Additionally, we compare ED-ViT with two state-of-the-art methods that deploy CNN and SNN models on edge devices, evaluating metrics such as accuracy, inference time, and overall model size. Our comprehensive evaluation underscores the effectiveness of the proposed ED-ViT framework.

Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference

TL;DR

This work tackles the challenge of running Vision Transformers on resource-constrained edge devices by introducing ED-ViT, a framework that partitions a ViT into class-specific sub-models, prunes them to reduce size, assigns them across multiple edge devices, and fuses the results with a compact MLP. The authors present a four-step pipeline—model splitting, pruning, assignment, and fusion—along with greedy strategies and a fusion-based aggregation to achieve distributed inference while preserving accuracy. Extensive experiments across five datasets and three ViT architectures demonstrate substantial reductions in latency and model size (up to tens of times smaller) with minimal accuracy loss, and show favorable comparisons against CNN- and SNN-based split methods. The work offers a practical pathway to scalable edge deployment of large ViTs and lays groundwork for future integration with other horizontal model-compression techniques.

Abstract

Deep learning models are increasingly utilized on resource-constrained edge devices for real-time data analytics. Recently, Vision Transformer and their variants have shown exceptional performance in various computer vision tasks. However, their substantial computational requirements and low inference latency create significant challenges for deploying such models on resource-constrained edge devices. To address this issue, we propose a novel framework, ED-ViT, which is designed to efficiently split and execute complex Vision Transformers across multiple edge devices. Our approach involves partitioning Vision Transformer models into several sub-models, while each dedicated to handling a specific subset of data classes. To further reduce computational overhead and inference latency, we introduce a class-wise pruning technique that decreases the size of each sub-model. Through extensive experiments conducted on five datasets using three model architectures and actual implementation on edge devices, we demonstrate that our method significantly cuts down inference latency on edge devices and achieves a reduction in model size by up to 28.9 times and 34.1 times, respectively, while maintaining test accuracy comparable to the original Vision Transformer. Additionally, we compare ED-ViT with two state-of-the-art methods that deploy CNN and SNN models on edge devices, evaluating metrics such as accuracy, inference time, and overall model size. Our comprehensive evaluation underscores the effectiveness of the proposed ED-ViT framework.

Paper Structure

This paper contains 21 sections, 2 equations, 7 figures, 4 tables, 3 algorithms.

Figures (7)

  • Figure 1: The overview of ED-ViT, including four steps: Model Splitting, Model Pruning, Model Assignment and Model Fusion.
  • Figure 2: Structured pruning of a Vision Transformer block. Left: illustration of prunable components in a ViT block. Right: corresponding sequential pruning process. Our approach targets three key components: (1) channels in residual connections (red, denoted as $d$), (2) the number of heads in the MHSA module (green, denoted as $h$), and (3) hidden layer channels in the FFN (blue, denoted as $c$). The pruning process occurs in three stages: residual connection channels, MHSA heads, and FFN hidden dimensions. Yellow regions indicate parameters being pruned in the current stage, while gray regions represent previously pruned parameters.
  • Figure 3: Our 5-device example experimental prototype utilizes a switch and Raspberry Pi 4B devices, with one dedicated to the fusion model and the other four allocated to sub-models.
  • Figure 4: Performance metrics of Split ViT-Base models on CIFAR-10, MNIST, Caltech dataset. Note that (a) shows the accuracy results; (b) shows the latency results, the dotted lines represent the latency of the original ViT-Base model, and (c) shows the total memory sizes for all the sub-models. All the experiment results are collected on Raspberry Pi-4B.
  • Figure 5: Performance metrics of Split ViT-Base models on GTZAN and Speech Command dataset.
  • ...and 2 more figures