Table of Contents
Fetching ...

AutoPrune: Each Complexity Deserves a Pruning Policy

Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang

TL;DR

This work tackles the inefficiency of long visual token sequences in vision–language models by introducing AutoPrune, a training-free, complexity-adaptive pruning framework. It quantifies input complexity via mutual information between early visual and textual tokens and converts this signal into budget-constrained logistic retention curves that dictate per-layer token pruning under a fixed compute budget $c_{\max}$. The method demonstrates strong, scalable gains across multiple VLM and VLA settings, achieving substantial token and FLOPs reductions with minimal accuracy loss (e.g., 89% token pruning with 96.7% accuracy on LLaVA-1.5-7B, and 9.1% improvement over PDrop). By integrating neuroscience-inspired insights into cross-modal processing, AutoPrune provides a simple, robust, and broadly applicable pruning paradigm for real-time multimodal reasoning and embodied intelligence.

Abstract

The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

AutoPrune: Each Complexity Deserves a Pruning Policy

TL;DR

This work tackles the inefficiency of long visual token sequences in vision–language models by introducing AutoPrune, a training-free, complexity-adaptive pruning framework. It quantifies input complexity via mutual information between early visual and textual tokens and converts this signal into budget-constrained logistic retention curves that dictate per-layer token pruning under a fixed compute budget . The method demonstrates strong, scalable gains across multiple VLM and VLA settings, achieving substantial token and FLOPs reductions with minimal accuracy loss (e.g., 89% token pruning with 96.7% accuracy on LLaVA-1.5-7B, and 9.1% improvement over PDrop). By integrating neuroscience-inspired insights into cross-modal processing, AutoPrune provides a simple, robust, and broadly applicable pruning paradigm for real-time multimodal reasoning and embodied intelligence.

Abstract

The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

Paper Structure

This paper contains 13 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Layer-wise Visual–Textual Interaction Patterns. By visualizing cross-modal attention at layers 2, 4, 8 and 16 of the VLM, we observe that for tasks requiring only object identification, attention rapidly converges on the salient region and remains stable, whereas for reasoning-intensive tasks attention shifts progressively across layers.
  • Figure 2: Logistic retention curves on the TextVQA dataset. Each curve corresponds to a QA pair, and is parameterized by the mutual information between visual and textual tokens. Samples/Tasks exhibiting lower mutual information show more conservative retention.