Table of Contents
Fetching ...

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Yudong Liu, Jingwei Sun, Yueqian Lin, Jingyang Zhang, Ming Yin, Qinsi Wang, Jianyi Zhang, Hai Li, Yiran Chen

TL;DR

This work tackles the inefficiency of vision-language models on long-form videos by introducing Keyframe-oriented Vision Token Pruning (KVTP), which softly preserves tokens from keyframes while sparsifying others based on query relevance and contextual cues. It integrates a fine-tuned query-frame relevance predictor (enhanced by a Local and Global Context Fusion Head) with a soft frame-level pruning mechanism, bridging token pruning and hard keyframe selection. To evaluate long-video reasoning with sparse information, the authors construct SparseKV-QA from multiple benchmarks and demonstrate that KVTP can achieve up to $80\%$ token reduction and around $64\%$ FLOPs reduction with negligible accuracy loss. The approach significantly improves efficiency for large vision-language models, enabling scalable deployment in real-world long-video applications.

Abstract

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

TL;DR

This work tackles the inefficiency of vision-language models on long-form videos by introducing Keyframe-oriented Vision Token Pruning (KVTP), which softly preserves tokens from keyframes while sparsifying others based on query relevance and contextual cues. It integrates a fine-tuned query-frame relevance predictor (enhanced by a Local and Global Context Fusion Head) with a soft frame-level pruning mechanism, bridging token pruning and hard keyframe selection. To evaluate long-video reasoning with sparse information, the authors construct SparseKV-QA from multiple benchmarks and demonstrate that KVTP can achieve up to token reduction and around FLOPs reduction with negligible accuracy loss. The approach significantly improves efficiency for large vision-language models, enabling scalable deployment in real-world long-video applications.

Abstract

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.

Paper Structure

This paper contains 25 sections, 15 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Performance vs. FLOPs across different methods. Results are averaged over SparseKV-QA, which comprises three subsets from VideoMME, EgoSchema, and NeXT-QA.
  • Figure 2: A typical sample from the VideoMME dataset. While the first question requires all frames for a comprehensive answer, the second and third questions focus on only a small subset of frames (highlighted in colored borders). These latter questions align with our target scenario, where identifying the relevant frames from the entire video is paramount.
  • Figure 3: A representative comparison between 3 types of efficiency pipeline from the LLaVA-Video-7B benchmark on Perception test. The top row illustrates hard frame selection, where only the most query-relevant frames are retained, causing the model to miss the events leading up to the water being poured, resulting in an incorrect answer. The middle row demonstrates unconditioned token pruning across frames, where excessive information loss in key frames prevents the model from determining whether the water was poured. The bottom row showcases keyframe-oriented token pruning, where most tokens from key frames are preserved while retaining some tokens from other frames to maintain contextual and temporal coherence, enabling the VLM to produce the correct answer.
  • Figure 4: Overview of the proposed data augmentation framework, applied to the VideoMME, EgoSchema, and NeXT-QA datasets. This process enhances each video with clip-level captions, de-biased queries, and query-frame relevance scores. The dataset is then divided into evaluation and training sets based on keyframe sparsity, as determined by the relevance scores.
  • Figure 5: Overview of the proposed KVTP framework. A keyframe predictor module is integrated to guide the token pruning process. A context fusion head is incorporated above the vision encoder, aggregating contextual information from both local clips and the global video. The fused logits, which encode both context and query-frame relevance information, are then converted into pruning rates as the final output.
  • ...and 3 more figures