Table of Contents
Fetching ...

NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Yun Yang

TL;DR

NuWa addresses the mismatch between broad-capacity base ViTs and the narrow requirements of edge-device tasks by deriving lightweight, task-specific ViTs through a two-stage, dimension-aware pruning framework. It combines one-shot pruning of depth, classifier size, and heads with adaptive pruning of query-key size, value size, expansion size, and embedding size, employing SVD-based pruning for qkv and energy-based thresholds to preserve task-relevant knowledge. Across three base ViTs and multiple datasets, NuWa yields up to 11.83% accuracy gains while delivering 1.29×–2.79× inference speedups, outperforming state-of-the-art pruning baselines. The approach enables practical edge deployment by balancing accuracy and latency, and highlights the importance of task and knowledge distribution across ViT dimensions for effective pruning.

Abstract

Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices' diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textit{over-qualified} for edge devices that usually demand only part of a ViT's knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to $\text{11.83}\%$ and accelerates model inference by 1.29$\times$ - 2.79$\times$. Code for reproduction is available at https://anonymous.4open.science/r/Task_Specific-3A5E.

NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

TL;DR

NuWa addresses the mismatch between broad-capacity base ViTs and the narrow requirements of edge-device tasks by deriving lightweight, task-specific ViTs through a two-stage, dimension-aware pruning framework. It combines one-shot pruning of depth, classifier size, and heads with adaptive pruning of query-key size, value size, expansion size, and embedding size, employing SVD-based pruning for qkv and energy-based thresholds to preserve task-relevant knowledge. Across three base ViTs and multiple datasets, NuWa yields up to 11.83% accuracy gains while delivering 1.29×–2.79× inference speedups, outperforming state-of-the-art pruning baselines. The approach enables practical edge deployment by balancing accuracy and latency, and highlights the importance of task and knowledge distribution across ViT dimensions for effective pruning.

Abstract

Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices' diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textit{over-qualified} for edge devices that usually demand only part of a ViT's knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to and accelerates model inference by 1.29 - 2.79. Code for reproduction is available at https://anonymous.4open.science/r/Task_Specific-3A5E.

Paper Structure

This paper contains 18 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Lightweight task-specific ViTs for diverse edge devices aiming to recognize specific classes.
  • Figure 2: Comparison between DeiT-Tiny, DeiT-Small, and DeiT-Tiny on 25, 50, and 100 random classes from ImageNet-1K, where 'DeiT-Tiny (FT)' represents DeiT-Tiny fine-tuned on task data.
  • Figure 3: Overview of NuWa: 1) One-shot pruning stage, where NuWa prunes the depth, the classifier size, and the number of heads; and 2) adaptive pruning stage, where NuWa prunes the query-key size, the value size, the expansion size, and the embedding size iteratively.
  • Figure 4: Accuracy of DeiT-Tiny with different classifiers on hard and simple sub-tasks. 'DeiT-Tiny (S)' represents DeiT-Tiny with a sub-classifier. H1, H2, H3 and S1, S2, S3 denote the hard and simple sub-tasks extracted from ImageNet-1K with $C_{edge}$=25.
  • Figure 5: Task relevance analysis with DeiT-Base: (a) number of heads; (b) expansion size; and (c) embedding size.
  • ...and 7 more figures