Table of Contents
Fetching ...

PIVOT- Input-aware Path Selection for Energy-efficient ViT Inference

Abhishek Moitra, Abhiroop Bhattacharjee, Priyadarshini Panda

TL;DR

PIVOT tackles the high latency and energy cost of ViT attention by making attention activation input-aware. It uses entropy to route inputs between a low-effort and a high-effort ViT, and employs a two-phase hardware-in-the-loop search with a CKA-based Path-Score to select optimal attention configurations, all validated via a cycle-accurate PIVOT-Sim on FPGA and across CPUs/GPUs. The approach delivers substantial reductions in energy-delay-product (EDP) with minimal accuracy loss (e.g., ~2.7x EDP reduction at ~0.2% accuracy loss on LVViT-S) and outperforms prior token pruning and sparsification methods, while remaining general-purpose and open-source. Overall, PIVOT enables efficient, input-aware ViT inference without requiring specialized hardware, facilitating practical deployment across diverse computing platforms.

Abstract

The attention module in vision transformers(ViTs) performs intricate spatial correlations, contributing significantly to accuracy and delay. It is thereby important to modulate the number of attentions according to the input feature complexity for optimal delay-accuracy tradeoffs. To this end, we propose PIVOT - a co-optimization framework which selectively performs attention skipping based on the input difficulty. For this, PIVOT employs a hardware-in-loop co-search to obtain optimal attention skip configurations. Evaluations on the ZCU102 MPSoC FPGA show that PIVOT achieves 2.7x lower EDP at 0.2% accuracy reduction compared to LVViT-S ViT. PIVOT also achieves 1.3% and 1.8x higher accuracy and throughput than prior works on traditional CPUs and GPUs. The PIVOT project can be found at https://github.com/Intelligent-Computing-Lab-Yale/PIVOT.

PIVOT- Input-aware Path Selection for Energy-efficient ViT Inference

TL;DR

PIVOT tackles the high latency and energy cost of ViT attention by making attention activation input-aware. It uses entropy to route inputs between a low-effort and a high-effort ViT, and employs a two-phase hardware-in-the-loop search with a CKA-based Path-Score to select optimal attention configurations, all validated via a cycle-accurate PIVOT-Sim on FPGA and across CPUs/GPUs. The approach delivers substantial reductions in energy-delay-product (EDP) with minimal accuracy loss (e.g., ~2.7x EDP reduction at ~0.2% accuracy loss on LVViT-S) and outperforms prior token pruning and sparsification methods, while remaining general-purpose and open-source. Overall, PIVOT enables efficient, input-aware ViT inference without requiring specialized hardware, facilitating practical deployment across diverse computing platforms.

Abstract

The attention module in vision transformers(ViTs) performs intricate spatial correlations, contributing significantly to accuracy and delay. It is thereby important to modulate the number of attentions according to the input feature complexity for optimal delay-accuracy tradeoffs. To this end, we propose PIVOT - a co-optimization framework which selectively performs attention skipping based on the input difficulty. For this, PIVOT employs a hardware-in-loop co-search to obtain optimal attention skip configurations. Evaluations on the ZCU102 MPSoC FPGA show that PIVOT achieves 2.7x lower EDP at 0.2% accuracy reduction compared to LVViT-S ViT. PIVOT also achieves 1.3% and 1.8x higher accuracy and throughput than prior works on traditional CPUs and GPUs. The PIVOT project can be found at https://github.com/Intelligent-Computing-Lab-Yale/PIVOT.
Paper Structure (14 sections, 3 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Figure showing the encoder architecture of a vision transformer. Q-Query, K-Key and V-Value. (b) Delay distribution across different ViT modules for DeiT-S (left) and LVViT-S (right) ViTs. Note, Attention delay is QKV+SM+QK$^T$+(SMxV)+Proj. (c) Throughput of PIVOT compared with DeiT-S Baseline (a standard DeiT-S touvron2021training ViT), prior token pruning (HeatViT dong2023heatvit) and attention sparsification (ViTCOD you2023vitcod) techniques implemented on GPUs- Nvidia V100, RTX2080ti, Jetson Orin Nano and CPUs- Intel Xeon and Raspberry Pi 4. (d) PIVOT's input difficulty-aware inference
  • Figure 2: Figure showing (a) Input difficulty-aware inference procedure with PIVOT (b) PIVOT's Phase 1 (b) Phase2 Methodology. $LEC$ denotes the user-provided low effort constraint which implies the fraction of inputs that must be classified by the low effort ViT. For PIVOT-Sim, ViT params include embedding dim size, mlp ratio etc. and systolic array params include array size, dataflow, etc.
  • Figure 3: (a) $CKA ~Matrix$ computed between the MLP output of $Encoder_i$ ($MLP_i$) and Attention output of $Encoder_{i+1}$ ($A_{i+1}$) for the DeiT-S ViT (b) Higher $CKA(MLP_i,A_{i+1})$ suggests data redundancy and the attention can be skipped.
  • Figure 4: (a) Path Accuracy vs. Path-Score ($\mathcal{S}$) corresponding to Effort = 6 for DeiT-S ViT. (b) Design space size if random search is performed in Phase2, without selecting optimal path for each effort in Phase1 (size normalized to PIVOT's design space size) (c) GPU hours for training DeiT-S, LVViT-S and PIVOT Efforts (normalized to GPU hours required for training DeiT-S from scratch).
  • Figure 5: Figure showing the PIVOT-Sim Platform.
  • ...and 4 more figures