Table of Contents
Fetching ...

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Xiaohu Huang, Hao Zhou, Kai Han

TL;DR

PruneVid tackles the computational burden of video data in multi-modal LLMs by pruning visual tokens without retraining. It merges temporally static tokens to reduce redundancy, clusters spatial tokens to further compress input, and leverages cross-attention inside the LLM to retain only tokens relevant to the question, with compressed KV caches during decoding. Across multiple video benchmarks and three video LLMs, it achieves over 80% token pruning with minimal performance loss and up to 1.55x speedup, while reducing FLOPs by 74–80% and memory usage. This training-free, model-agnostic approach enables practical, scalable deployment of efficient video understanding in diverse applications.

Abstract

In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs' reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

TL;DR

PruneVid tackles the computational burden of video data in multi-modal LLMs by pruning visual tokens without retraining. It merges temporally static tokens to reduce redundancy, clusters spatial tokens to further compress input, and leverages cross-attention inside the LLM to retain only tokens relevant to the question, with compressed KV caches during decoding. Across multiple video benchmarks and three video LLMs, it achieves over 80% token pruning with minimal performance loss and up to 1.55x speedup, while reducing FLOPs by 74–80% and memory usage. This training-free, model-agnostic approach enables practical, scalable deployment of efficient video understanding in diverse applications.

Abstract

In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs' reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.

Paper Structure

This paper contains 19 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) PruneVid first identifies the static regions in the video that exhibit minimal variation, thereby compressing the redundancy of static tokens along the temporal dimension. It then further reduces spatial redundancy through compression in the spatial dimension. Subsequently, within the LLM, PruneVid utilizes question-to-visual attention scores to guide the selection of relevant visual tokens. (b) Static regions refer to areas with minimal change, while dynamic regions exhibit motion. Therefore, static regions can be compressed together along the temporal dimension. (c) Visualization of how attention evolves from shallow to deep layers (32 layers in total). The question tokens attend to semantically related visual regions (e.g., the hands and window) throughout different layers.
  • Figure 2: Illustration of the PruneVid framework. We begin by segmenting the video into different scenes and then decouple the video tokens into static and dynamic ones. Next, we compress the static tokens along the temporal dimension and merge similar tokens in the spatial dimension to further reduce redundancy. Afterward, by using the question-to-video attention weights learned from an intermediate layer, we determine which tokens should be pruned to improve efficiency.
  • Figure 3: The ablation study of hyper-parameters.
  • Figure 4: Visualization of the question-to-visual attentions and token selection of PruneVid.
  • Figure 5: Attention map comparison of video encoders and our method.
  • ...and 1 more figures