Table of Contents
Fetching ...

VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers

Ruanjun Li, Yuedong Tan, Yuanming Shi, Jiawei Shao

TL;DR

VideoScan addresses the challenge of real-time streaming video understanding by compressing each frame to a single semantic carrier token, enabling a two-phase prefilling–decoding workflow that drastically reduces computation. It leverages semantic flow and a memory of past KV states to preserve temporal coherence while discarding frame-level visual tokens after prefilling, achieving up to 6 FPS with stable ~18 GB GPU memory. A two-stage training plan (LoRA-based initial fine-tuning, followed by semantic-flow–aware training) reinforces temporal-semantic coherence without extra token-generation parameters. The approach delivers strong efficiency with competitive accuracy across offline and online benchmarks, making real-time vision-language interactions more feasible for robotics, surveillance, and interactive systems.

Abstract

This paper introduces VideoScan, an efficient vision-language model (VLM) inference framework designed for real-time video interaction that effectively comprehends and retains streamed video inputs while delivering rapid and accurate responses. A longstanding challenge in video understanding--particularly for long-term or real-time applications--stems from the substantial computational overhead caused by the extensive length of visual tokens. To address this, VideoScan employs a single semantic carrier token to represent each frame, progressively reducing computational and memory overhead during its two-phase inference process: prefilling and decoding. The embedding of the semantic carrier token is derived from an optimized aggregation of frame-level visual features, ensuring compact yet semantically rich representations. Critically, the corresponding key-value pairs are trained to retain contextual semantics from prior frames, enabling efficient memory management without sacrificing temporal coherence. During inference, the visual tokens of each frame are processed only once during the prefilling phase and subsequently discarded in the decoding stage, eliminating redundant computations. This design ensures efficient VLM inference even under stringent real-time constraints. Comprehensive experiments on diverse offline and online benchmarks demonstrate that LLaVA-Video, supported by our method, achieves up to $\sim 5\times$ and $1.29\times$ speedups compared to its original version and previous efficient streaming video understanding approaches, respectively. Crucially, these improvements are attained while maintaining competitive performance and ensuring stable GPU memory consumption (consistently $\sim 18$GB, independent of video duration).

VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers

TL;DR

VideoScan addresses the challenge of real-time streaming video understanding by compressing each frame to a single semantic carrier token, enabling a two-phase prefilling–decoding workflow that drastically reduces computation. It leverages semantic flow and a memory of past KV states to preserve temporal coherence while discarding frame-level visual tokens after prefilling, achieving up to 6 FPS with stable ~18 GB GPU memory. A two-stage training plan (LoRA-based initial fine-tuning, followed by semantic-flow–aware training) reinforces temporal-semantic coherence without extra token-generation parameters. The approach delivers strong efficiency with competitive accuracy across offline and online benchmarks, making real-time vision-language interactions more feasible for robotics, surveillance, and interactive systems.

Abstract

This paper introduces VideoScan, an efficient vision-language model (VLM) inference framework designed for real-time video interaction that effectively comprehends and retains streamed video inputs while delivering rapid and accurate responses. A longstanding challenge in video understanding--particularly for long-term or real-time applications--stems from the substantial computational overhead caused by the extensive length of visual tokens. To address this, VideoScan employs a single semantic carrier token to represent each frame, progressively reducing computational and memory overhead during its two-phase inference process: prefilling and decoding. The embedding of the semantic carrier token is derived from an optimized aggregation of frame-level visual features, ensuring compact yet semantically rich representations. Critically, the corresponding key-value pairs are trained to retain contextual semantics from prior frames, enabling efficient memory management without sacrificing temporal coherence. During inference, the visual tokens of each frame are processed only once during the prefilling phase and subsequently discarded in the decoding stage, eliminating redundant computations. This design ensures efficient VLM inference even under stringent real-time constraints. Comprehensive experiments on diverse offline and online benchmarks demonstrate that LLaVA-Video, supported by our method, achieves up to and speedups compared to its original version and previous efficient streaming video understanding approaches, respectively. Crucially, these improvements are attained while maintaining competitive performance and ensuring stable GPU memory consumption (consistently GB, independent of video duration).

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance comparison on MLVU benchmark. Supported by the efficiency of VideoScan, LLaVA-Video reduces $>99$% of vision tokens in inference while maintaining comparable performance, which achieves $>45$% improvement the performance of LLaVA-Mini with only one vision token for each frame.
  • Figure 2: An example of attention map. For better visualization, we truncate the tokens corresponding to system instructions. The attention map reveals an 'attention sink,' where the model tends to assign higher scores to tokens with nearby positions. Meanwhile, when projecting the averaged attention score at generated tokens onto the original image, it becomes evident that the visual tokens the model focuses on differ from those humans perceive as important, and somehow they tend to be near the response.
  • Figure 3: The overall workflow of the proposed VideoScan inference framework. We construct a semantic carrier token by the frame level visual features through an average pooling, and leverage it to inherit all in-context semantic information in KV. Frame-level visual tokens are processed exclusively during the prefilling phase and subsequently discarded. All visual information required during decoding is sourced from the semantic carrier. We also introduce a memory mechanism, which stores the embeddings and KV of semantic carriers, to support long-term video interactions with retrievable past visual information and optimized GPU memory usage.
  • Figure 4: The proposed two-stage training recipe for VideoScan. At stage 1, each frame is represented by a semantic carrier token. The visual inputs are semantic carrier tokens only. At stage 2, the semantic token is positioned at the end of each frame. A semantic-aware causal mask is implemented to enhance the semantic flow in KV, maintaining that the LLM only accesses the semantic carrier.