PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
Ye Li, Chen Tang, Yuan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, Wenwu Zhu
TL;DR
PRANCE tackles the problem of efficiently deploying Vision Transformers by jointly optimizing architectural channels and input tokens on a per-sample basis. It introduces a channel-elastic meta-network trained with weight-sharing to support arbitrary MHSA/MLP dimensions, coupled with a lightweight PPO-based selector that models inference as a Markov decision process and employs a novel Result-to-Go training to reduce the action space. The method supports token pruning, merging, and pruning-then-merging, and demonstrates substantial FLOPs reductions (around 50%) while retaining roughly 10% of tokens and achieving lossless or near lossless Top-1 accuracy across ViT-Tiny, ViT-Small, and ViT-Base. These results indicate a practical, adaptive approach to ViT compression that can outperform state-of-the-art lightweight methods and offer significant runtime efficiency for real-world deployments.
Abstract
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
