Table of Contents
Fetching ...

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

TL;DR

DyCoke tackles the high computational cost of video LLMs by introducing a training-free, two-stage dynamic token compression framework that exploits temporal and spatial redundancy in video tokens. It first merges temporally similar tokens across frames (TTM) and then dynamically prunes the KV cache during decoding (DP), ensuring critical tokens are retained at each step. Empirical results across multiple benchmarks and model sizes show significant improvements: approximately 1.5x faster inference and 1.4x memory reduction with maintained or enhanced accuracy, outperforming prior training-free methods. This approach enables faster, longer-video reasoning with minimal overhead, offering practical benefits for deploying VLLMs at scale.

Abstract

Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against the baseline VLLM, while still improving the performance, with no training.

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

TL;DR

DyCoke tackles the high computational cost of video LLMs by introducing a training-free, two-stage dynamic token compression framework that exploits temporal and spatial redundancy in video tokens. It first merges temporally similar tokens across frames (TTM) and then dynamically prunes the KV cache during decoding (DP), ensuring critical tokens are retained at each step. Empirical results across multiple benchmarks and model sizes show significant improvements: approximately 1.5x faster inference and 1.4x memory reduction with maintained or enhanced accuracy, outperforming prior training-free methods. This approach enables faster, longer-video reasoning with minimal overhead, offering practical benefits for deploying VLLMs at scale.

Abstract

Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against the baseline VLLM, while still improving the performance, with no training.

Paper Structure

This paper contains 23 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Left: We introduce DyCoke (dynamic compression of tokens), a training-free token compression method for fast video large language models. The key innovation of DyCoke over its predecessors is to dynamically remove redundant tokens during the decoding stage, squeezing both the temporal (video frames) and spatial redundancy in visual tokens. Right: Efficiency and performance comparison of various training-free token pruning methods on MVBench li2024mvbench with LLaVA-OV-7B llava-ov. DyCoke surpasses the SoTA counterparts (PruMerge shang2024llava, FastV fastv), with 1.5$\times$ inference speedup and a 1.4$\times$ reduction in memory usage relative to the baseline, while simultaneously enhancing performance.
  • Figure 2: Attention score between the predicted token at different decoding iterations (x-axis) and the input video tokens (y-axis) at the decoding stage of LLaVA-OV-7B llava-ov (attention score averaged over all attention layers). Note, that some video tokens (e.g., frame #1) become less important as the decoding proceeds, while others may instead become more important (e.g., frame #16). This observation motivates us to develop DyCoke, a training-free plug-and-play token compression method that can dynamically exploit the token redundancy during decoding.
  • Figure 3: Detailed overview of our DyCoke method. DyCoke compresses visual tokens in VLLMs through a two-stage pruning process: visual token temporal merging (TTM) and KV cache dynamic pruning. Token temporal merging (illustrated in the red dashed box on the left) merges similar tokens in video frames at the prefilling stage, tapping into the temporal redundancy of the video input; KV cache dynamic pruning (illustrated in the blue dashed box on the right) further removes less attended visual tokens in the KV cache dynamically at the decoding stage, exploiting the spatial redundancy in visual tokens. DyCoke is a drop-in training-free approach to accelerate VLLMs.
  • Figure 4: Showcases of our DyCoke compared to FastV with LLaVA-OV 7B on MVBench. The first row shows that after token compression by FastV, the model generates the wrong answer while our method still retains the correct answer. The second row demonstrates a case that our token compression method can calibrate the mistake from attending full tokens, suggesting that retaining less but key information can enhance the model’s capability for correct video understanding.
  • Figure 5: Performance vs. $K$ values in different input frames.
  • ...and 2 more figures