Table of Contents
Fetching ...

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li

TL;DR

This paper tackles the high computational cost of large vision-language models caused by redundant vision tokens. It introduces FlowCut, an information-flow-aware pruning framework that models token interactions across layers, using a CLS token-based proxy and hub-token dynamics to identify redundancy. By combining adaptive prune ratios based on attention entropy, a multi-criteria token evaluator, and cumulative flow tracking, FlowCut achieves substantial efficiency gains while preserving accuracy on image and video tasks. The results demonstrate notable speedups (up to ~3×) with minimal performance loss, validating the practical value of aligning pruning with intrinsic information flow in LVLMs.

Abstract

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

TL;DR

This paper tackles the high computational cost of large vision-language models caused by redundant vision tokens. It introduces FlowCut, an information-flow-aware pruning framework that models token interactions across layers, using a CLS token-based proxy and hub-token dynamics to identify redundancy. By combining adaptive prune ratios based on attention entropy, a multi-criteria token evaluator, and cumulative flow tracking, FlowCut achieves substantial efficiency gains while preserving accuracy on image and video tasks. The results demonstrate notable speedups (up to ~3×) with minimal performance loss, validating the practical value of aligning pruning with intrinsic information flow in LVLMs.

Abstract

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

Paper Structure

This paper contains 37 sections, 6 equations, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: (Left) Performance across various visual understanding benchmarks on diverse LVLMs, FlowCut significantly outperforms other methods. (Right) We provide a unified, bottom-up perspective based on information flow by modeling inter-token interactions across layers, revealing the dynamic emergence of redundancy and guiding pruning decisions to align with this inherent behavior.
  • Figure 2: (Left) Information outflow and inflow across vision encoder layers. (Right) Attention map of various tokens across vision encoder layers. Patch tokens engage in sparse, selective information interactions while the CLS token acts as a context relay that gathers and provides information globally.
  • Figure 3: (Left) Average attention distance of patch tokens generally increases as the layer deepens. (Middle) Attention entropy across vision encoder layers. (Right) The CLS token attention across layers: dynamic evolution of attention distributions.
  • Figure 4: Different criteria for identifying critical tokens can show contradictory results, implying the instability of each single criterion in each single layer.
  • Figure 5: The overview of FlowCut, an information flow-aware pruning framework. The process involves: (1) adaptively determining pruning ratios based on attention concentration; (2) evaluating token importance via multiple criteria; and (3) pruning tokens based on combined current and historical scores, with historical values updated accordingly.
  • ...and 7 more figures