Table of Contents
Fetching ...

A Simple Baseline for Streaming Video Understanding

Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu

Abstract

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

A Simple Baseline for Streaming Video Understanding

Abstract

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

Paper Structure

This paper contains 31 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A Strong Simple Baseline for Streaming Video Understanding. (a) Overview of SimpleStream: given a streaming video and a query, only the most recent $N$ frames are fed to an off-the-shelf VLM. (b) Perception-memory comparison on OVO-Bench, where SimpleStream consistently lies on the upper-right frontier across backbone families and window sizes.
  • Figure 2: A landscape of streaming video understanding methods. Most existing approaches differ mainly in how they preserve and reuse historical information under bounded budgets, while SimpleStream keeps only a small recent frame window.
  • Figure 3: Peak GPU memory vs. observed frames.SimpleStream-4f maintains the lowest and flattest memory curve because it retains only a fixed recent frame window.
  • Figure 4: Window-size ablation. Under this controlled setting, SimpleStream reaches its highest Real-Time accuracy with 4 recent frames, while overall accuracy does not improve monotonically as the window widens.
  • Figure 5: Model-scaling ablation on OVO-Bench. Average accuracy versus recent-window size for Qwen2.5-VL (left) and Qwen3-VL (right) checkpoints. Stars mark the best window for each checkpoint. Many checkpoints peak at 4f, but several prefer longer windows, including Qwen3-VL-4B at 16f and larger checkpoints at 8f or 16f.
  • ...and 1 more figures