Table of Contents
Fetching ...

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu

TL;DR

VideoLLMs suffer from quadratic computational costs due to abundant visual tokens. SharpV introduces a two-stage, training-free pruning framework: Visual SharpV performs information-aware, frame-level visual token pruning before inference, and Memory SharpV prunes KV caches during decoding based on visual information degradation, all without exposing attention scores. The approach uses a dissimilarity-based metric to derive spatio-temporal token importance and a frame-wise adaptive threshold, plus a degradation-based pruning rule for memory. Empirical results show SharpV achieves competitive or improved accuracy with substantially reduced token budgets and resource usage across diverse benchmarks, while remaining compatible with hardware accelerators like Flash Attention.

Abstract

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

TL;DR

VideoLLMs suffer from quadratic computational costs due to abundant visual tokens. SharpV introduces a two-stage, training-free pruning framework: Visual SharpV performs information-aware, frame-level visual token pruning before inference, and Memory SharpV prunes KV caches during decoding based on visual information degradation, all without exposing attention scores. The approach uses a dissimilarity-based metric to derive spatio-temporal token importance and a frame-wise adaptive threshold, plus a degradation-based pruning rule for memory. Empirical results show SharpV achieves competitive or improved accuracy with substantially reduced token budgets and resource usage across diverse benchmarks, while remaining compatible with hardware accelerators like Flash Attention.

Abstract

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: VideoLLM Response Demonstration Using Training-Free SharpV Pruning, Blue bars denote the per-frame token retention ratios dynamically computed by Visual SharpV, reflecting information-aware pruning.
  • Figure 2: The Detailed Overview of SharpV. SharpV is a two-stage training-free plug-and-play framework for video LLM pruning. In the pre-LLM stage, Visual SharpV selects important visual tokens based on spatio-temporal scores, with an adaptive pruning ratio determined by L2 norm and a dissimilarity module. In the intra-LLM stage, Memory SharpV dynamically discards key-value cache by evaluating layer-wise visual information degradation.
  • Figure 3: Similarity across different layers
  • Figure 4: Average Efficiency
  • Figure 5: Ablation Study of Parameters: $M,w,K$