Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin; Xin Zou; Di Lu; Yibo Yan; Xuming Hu

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu

TL;DR

VideoLLMs suffer from quadratic computational costs due to abundant visual tokens. SharpV introduces a two-stage, training-free pruning framework: Visual SharpV performs information-aware, frame-level visual token pruning before inference, and Memory SharpV prunes KV caches during decoding based on visual information degradation, all without exposing attention scores. The approach uses a dissimilarity-based metric to derive spatio-temporal token importance and a frame-wise adaptive threshold, plus a degradation-based pruning rule for memory. Empirical results show SharpV achieves competitive or improved accuracy with substantially reduced token budgets and resource usage across diverse benchmarks, while remaining compatible with hardware accelerators like Flash Attention.

Abstract

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

TL;DR

Abstract

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)