Table of Contents
Fetching ...

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang

TL;DR

The paper tackles real-time streaming video understanding by addressing the dual bottlenecks of ViT encoding and LLM prefill. It introduces Streaming Token Compression (STC), a two-stage, plug-and-play framework with STC-Cacher for selective ViT recomputation and STC-Pruner for causal token pruning, designed to operate under streaming constraints. Empirical results across multiple benchmarks and VideoLLMs show STC achieves substantial latency reductions (up to 24.5% ViT and 45.3% LLM improvements) while preserving high accuracy (up to 99% on ReKV). The work demonstrates state-of-the-art performance-efficiency trade-offs and broad compatibility, enabling more practical deployment of VideoLLMs in latency-sensitive applications.

Abstract

Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99\%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5\%} and \textbf{45.3\%}.

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

TL;DR

The paper tackles real-time streaming video understanding by addressing the dual bottlenecks of ViT encoding and LLM prefill. It introduces Streaming Token Compression (STC), a two-stage, plug-and-play framework with STC-Cacher for selective ViT recomputation and STC-Pruner for causal token pruning, designed to operate under streaming constraints. Empirical results across multiple benchmarks and VideoLLMs show STC achieves substantial latency reductions (up to 24.5% ViT and 45.3% LLM improvements) while preserving high accuracy (up to 99% on ReKV). The work demonstrates state-of-the-art performance-efficiency trade-offs and broad compatibility, enabling more practical deployment of VideoLLMs in latency-sensitive applications.

Abstract

Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99\%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5\%} and \textbf{45.3\%}.

Paper Structure

This paper contains 22 sections, 9 equations, 8 figures, 10 tables, 2 algorithms.

Figures (8)

  • Figure 1: Inference time breakdown across components in various vision-language understanding scenarios. ViT encoding typically accounts for a substantial fraction of the inference time in video understanding, about 2-3 times that in image understanding.
  • Figure 2: Temporal redundancy in adjacent frames in ViT encoding. Streaming videos ("online") tend to show higher similarity than offline videos, indicating higher temporal redundancy.
  • Figure 3: Overview of Streaming Token Compression (STC). Our framework accelerates streaming Video-LLMs in two stages. STC-Cacher employs selective recomputation to reduce computational redundancy in the ViT. STC-Pruner then reduces the token sequence to alleviate the prefilling latency for the LLM.
  • Figure 4: Visualization of cache-aware selective computation by STC-Cacher. For reference frames, STC-Cacher computes and caches all tokens. For subsequent frames, only dynamic tokens are computed, while static tokens reuse cached features from reference frames.
  • Figure 5: The Mechanism of STC-Cacher. Instead of a full forward pass, STC-Cacher identifies novel tokens by comparing their Key projections ($K_{\text{curr}}$) to a cached reference ($K_{\text{ref}}$). It then selectively recomputes only the Query and Value representations for these dynamic tokens and scatters Value into the cached Value matrix for an efficient, low-rank update attention mechanism.
  • ...and 3 more figures