Table of Contents
Fetching ...

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, Zhuotao Tian

TL;DR

DyToK introduces a training-free framework for dynamic per-frame token compression in Video LLMs by exploiting latent keyframe priors embedded in LLM attention. A lightweight assistant model estimates frame-level importance from cross-modal attention, and a per-frame budget allocator distributes a global token budget to preserve salient frames while compressing redundant ones, compatible with both encoder-based and LLM-based pruning methods. Empirical results on long-video benchmarks show state-of-the-art efficiency-accuracy tradeoffs, with substantial accuracy gains under aggressive compression and up to 4.3x faster inference. The approach reveals a broadly transferable principle: deeper attention layers encode task-relevant priors that can guide temporal compression without retraining, enabling scalable, plug-and-play acceleration for VLLMs.

Abstract

Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

TL;DR

DyToK introduces a training-free framework for dynamic per-frame token compression in Video LLMs by exploiting latent keyframe priors embedded in LLM attention. A lightweight assistant model estimates frame-level importance from cross-modal attention, and a per-frame budget allocator distributes a global token budget to preserve salient frames while compressing redundant ones, compatible with both encoder-based and LLM-based pruning methods. Empirical results on long-video benchmarks show state-of-the-art efficiency-accuracy tradeoffs, with substantial accuracy gains under aggressive compression and up to 4.3x faster inference. The approach reveals a broadly transferable principle: deeper attention layers encode task-relevant priors that can guide temporal compression without retraining, enabling scalable, plug-and-play acceleration for VLLMs.

Abstract

Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .

Paper Structure

This paper contains 68 sections, 8 equations, 17 figures, 17 tables, 1 algorithm.

Figures (17)

  • Figure 1: Unveiling the keyframe prior in VLLMs. LLaVA-OneVision’s answers to video QA tasks are shown on the left. On the right, we plot the averaged attention from the final text token to visual tokens across all layers and within each frame. The top-8 frames by attention scores are arranged in time order, and the Ground Truth (GT) keyframes are highlighted in red. We observe that even when the model answers incorrectly, its attention still pinpoints the relevant frames, revealing a strong task-dependent keyframe prior.
  • Figure 2: Efficient inference methods for VLLMs. (a) LLM attention-based methods perform token pruning during LLM inference by selecting visual tokens through cross-modal attention maps from specific layers, hence suffer from constrained pruning accuracy due to their reliance on noisy shallow-layer attention patterns. (b) Encoder feature-based methods prune tokens post-encoder using inter-patch feature correlations, but neglect temporal dynamics essential for video understanding. (c) Our approach uniquely exploits the keyframe priors embedded within LLMs to dynamically allocate frame-specific compression ratios, enabling plug-and-play enhancement of temporal perception capabilities in existing efficient VLLMs.
  • Figure 3: Illustration of DyToK. We adaptively compress video tokens through two synergistic components: (1) Temporal Importance Estimation leverages cross-modal attention from a lightweight assistant model to identify keyframes, followed by (2) Dynamic Frame-Level Compression that proportionally allocates token budgets to preserve salient content. This training-free paradigm achieves superior efficiency-accuracy tradeoffs by dynamically adjusting compression ratios per frame while maintaining compatibility with diverse token pruning methods.
  • Figure 4: Performance gains of DyToK under various retention ratios. Performance comparison of SOTA acceleration methods with and without DyToK on LLaVA-OneVision under 32-frame input. Experiments conducted on VideoMME, LongVideoBench, and MLVU across varying retention ratios show that integrating DyToK consistently improves accuracy, demonstrating its effectiveness in enhancing long-video understanding. The scores presented in the figure represent average performance across the three benchmarks. For detailed results, please refer to Tab. \ref{['tab:all_32_frames_encoder']} and Tab. \ref{['tab:all_32_frames_llm']}
  • Figure 5: Analysis of VLLMs with 32-frame inputs. We visualize the attention behavior of LLaVA-OneVision and LLaVA-Video on a 32-frame video input under different queries. Each row shows the model’s predicted answer, the layer-frame correlation heatmap, the frame-wise attention distribution, and the layer-wise attention weights. Both models exhibit consistent and accurate localization of task-relevant keyframes. However, we also observe a recurring bias toward the initial and final frames, where attention is disproportionately high despite their limited relevance.
  • ...and 12 more figures