Table of Contents
Fetching ...

Thinking in Streaming Video

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu

Abstract

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

Thinking in Streaming Video

Abstract

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream
Paper Structure (40 sections, 7 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 7 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the Streaming Watch--Think--Speak paradigm. As video chunks arrive sequentially, the model continuously updates its understanding through incremental reasoning steps (<think>). Each update integrates newly observed evidence with accumulated context. Based on this evolving interpretation, the model decides whether sufficient evidence has been gathered to produce a response (<response>) or whether it should remain silent (<silent>) and continue observing the stream.
  • Figure 2: Overview of the ThinkStream framework. (a) Streaming Watch-Think-Speak Paradigm & RLVR: The model undergoes streaming rollouts and policy updates driven by format, latency, and accuracy rewards. (b) Reasoning-Compressed Streaming Memory: Outdated dense video tokens are dynamically evicted from the KV cache, while highly compressed reasoning and response tokens are retained as long-term semantic anchors. (c) Streaming Inference: A custom backend utilizes Eager Prefill for variable tokens and replayable CUDA Graphs for both the Decode and Evict Kernels, enabling an efficient chunk-by-chunk streaming loop with in-place memory shifting.
  • Figure 3: Token decoding speed comparison across different batch sizes. The custom CUDA Graph-based streaming inference engine achieves a massive speedup compared to the standard Qwen2.5-VL-3B baseline, maintaining high throughput while preserving flexible KV cache control.
  • Figure 4: Real-time latency scaling with processed video length. ThinkStream successfully bounds the end-to-end inference latency below the 0.5s real-time threshold (required for 2 FPS inputs) as the video context grows, whereas the baseline model scales poorly and consistently violates the threshold.
  • Figure 5: Data Distribution
  • ...and 2 more figures