PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
TL;DR
PruneVid tackles the computational burden of video data in multi-modal LLMs by pruning visual tokens without retraining. It merges temporally static tokens to reduce redundancy, clusters spatial tokens to further compress input, and leverages cross-attention inside the LLM to retain only tokens relevant to the question, with compressed KV caches during decoding. Across multiple video benchmarks and three video LLMs, it achieves over 80% token pruning with minimal performance loss and up to 1.55x speedup, while reducing FLOPs by 74–80% and memory usage. This training-free, model-agnostic approach enables practical, scalable deployment of efficient video understanding in diverse applications.
Abstract
In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs' reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.
