Table of Contents
Fetching ...

PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference

Ye Li, Chen Tang, Yuan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, Wenwu Zhu

TL;DR

PRANCE tackles the problem of efficiently deploying Vision Transformers by jointly optimizing architectural channels and input tokens on a per-sample basis. It introduces a channel-elastic meta-network trained with weight-sharing to support arbitrary MHSA/MLP dimensions, coupled with a lightweight PPO-based selector that models inference as a Markov decision process and employs a novel Result-to-Go training to reduce the action space. The method supports token pruning, merging, and pruning-then-merging, and demonstrates substantial FLOPs reductions (around 50%) while retaining roughly 10% of tokens and achieving lossless or near lossless Top-1 accuracy across ViT-Tiny, ViT-Small, and ViT-Base. These results indicate a practical, adaptive approach to ViT compression that can outperform state-of-the-art lightweight methods and offer significant runtime efficiency for real-world deployments.

Abstract

We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.

PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference

TL;DR

PRANCE tackles the problem of efficiently deploying Vision Transformers by jointly optimizing architectural channels and input tokens on a per-sample basis. It introduces a channel-elastic meta-network trained with weight-sharing to support arbitrary MHSA/MLP dimensions, coupled with a lightweight PPO-based selector that models inference as a Markov decision process and employs a novel Result-to-Go training to reduce the action space. The method supports token pruning, merging, and pruning-then-merging, and demonstrates substantial FLOPs reductions (around 50%) while retaining roughly 10% of tokens and achieving lossless or near lossless Top-1 accuracy across ViT-Tiny, ViT-Small, and ViT-Base. These results indicate a practical, adaptive approach to ViT compression that can outperform state-of-the-art lightweight methods and offer significant runtime efficiency for real-world deployments.

Abstract

We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around , making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
Paper Structure (11 sections, 20 equations, 9 figures, 8 tables)

This paper contains 11 sections, 20 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of PRANCE with SOTA methods.PRANCE achieves both higher Top-1 accuracy and lower complexity (FLOPs) in ImageNet.
  • Figure 2: Illustration of the inference process of PRANCE.PRANCE is a lightweight framework for ViTs that jointly optimizes model structure and data. First of all, the framework divides the ViT model into four groups according to the inference sequence, each containing multiple ViT blocks. During inference, the selector utilizes the features of each group step by step to decide the model channel dimensions and token numbers for them, aiming to minimize FLOPs while ensuring accuracy. Moreover, PRANCE supports three main token optimization methods: pruning, merging, and pruning-then-merging.
  • Figure 3: The framework of PRANCE.Left: The training of PRANCE consists of two stages: (1) Meta-model Pretraining. The meta-network is trained using the weight-sharing mechanism, where the smaller channels are subsets of the large channels, to support the variable channels. To simulate the variable channel decisions, a configuration is randomly selected for the MHSA layer and MLP layer in each training step. In this stage, we do not perform token optimization. (2) Sample-wise architecture-data joint optimization. After convergence of the meta-network, we freeze the meta-network and train the PPO selector using the "Result-to-Go" mechanism. In this stage, the PPO selector will jointly make the decisions for channel reduction of the MHSA layer and MLP layer, along with the decision of token reduction. Right: We adopt a sample-wise masking mechanism for supporting batched training of the selector, where the decisions are generated in the form of 0-1 mask and applied on the corresponding inputs (e.g., tokens, channels) using Hadamard product accordingly, to ensure dimensional consistency. During inference, the sample-wise mask can be replaced by averaging the decisions of each batch.
  • Figure 4: The workflow of "Result-to -Go". This mechanism is only used for training the selector. To receive immediate feedback for each decision, the meta-network is divided into multiple groups. Initially, the meta-network is set to the maximum channel number for all groups. The selector then optimizes the model channels and tokens numbers for a single group at a time, allowing the meta-network to run to the end and obtain immediate feedback. Since the meta-network is fixed, its inference process can be viewed as a Markov decision process, allowing the selector to modify the structure of the meta-network groups one by one.
  • Figure 5: Visualization of token pruning in different transformer groups.PRANCE effectively identifies and retains important tokens while removing unimportant ones to reduce the complexity of ViTs. Besides, our framrwork prefers to retain tokens in the early stages and optimize a large number of low-information tokens in the later stages.
  • ...and 4 more figures