LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Kaixin Xu; Zhe Wang; Chunyun Chen; Xue Geng; Jie Lin; Mohamed M. Sabry Aly; Xulei Yang; Min Wu; Xiaoli Li; Weisi Lin

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Mohamed M. Sabry Aly, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin

TL;DR

Vision transformers offer strong accuracy but are resource-intensive, motivating pruning for practical deployment. LPViT introduces a block-structured, semi-structured pruning approach tailored to ViT linear layers, paired with a hardware-aware objective that optimizes both speedup and power consumption under a FLOPs constraint. A second-order Taylor-based distortion estimate and a lightweight empirical complexity method enable an efficient post-training pruning workflow without LUTs, yielding real-world hardware benefits. Across DeiT and Swin architectures, LPViT achieves substantial speedups (up to $3.93\times$ on certain hardware) and power reductions with competitive accuracy, demonstrating meaningful energy-efficient deployment potential.

Abstract

Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve 3.93x speedup on dedicated hardware and GPUs respectively for DeiT-B, and a power reduction by 1.4x on GPUs. Code released to https://github.com/Akimoto-Cris/LPViT.

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

TL;DR

on certain hardware) and power reductions with competitive accuracy, demonstrating meaningful energy-efficient deployment potential.

Abstract

Paper Structure (18 sections, 14 equations, 5 figures, 4 tables)

This paper contains 18 sections, 14 equations, 5 figures, 4 tables.

Introductions
Related Works
Vision Transformers (ViTs)
Pruning on CNNs
Sparsity in ViTs
Methodologies
Preliminaries
Hardware-aware pruning objective
Second-order Approximation of Output Distortion
Power consumption under Block-structured Pruning
Finding Solution to Pruning Objective
Empirical Complexity
Experiments
Datasets and Benchmarks
Main results
...and 3 more sections

Figures (5)

Figure 1: Trade-offs of different sparsity schemes in terms of model accuracy and hardware acceleration.
Figure 2: Illustration of the proposed Low Power Semi-structured pruning method. Widths of different layers within ViT block visualizes the computation complexities (FLOPs) of single layer. We first extract all layers with prunable weights in the pretrained ViTs, then we obtain the empirical curves $\delta$-vs-sparsity as described in Eq. \ref{['eq:langrangian2']}. We further calculate the layer specific target slope $\lambda_i$ according to its contribution to the power consumption and select the layer-wise pruning ratios when the target slopes are tangential to the curves. Finally we prune the layer weights given their pruning ratios in block-structured sparsity, and finally finetune the pruned ViTs. The rightmost of the diagram is an example of the block-sparsity patterns when block sizes for both dimensions are the same, but they don't have to be the same as in the experiment section.
Figure 3: Inference overhead and power reductions on hardware platforms.
Figure 4: Layerwise sparsity for DeiT-Base BK64BN64.
Figure 5: Segmentation results on Cityscapes valiadtion dataset.

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

TL;DR

Abstract

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)