Table of Contents
Fetching ...

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao

Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/
Paper Structure (30 sections, 15 equations, 10 figures, 12 tables)

This paper contains 30 sections, 15 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of Think While Watching.(a) Interleaved baseline. Video perception and answer generation are executed sequentially, which can cause memory erosion, where early memory is forgotten, and a serialization bottleneck, where generation stalls further input ingestion. (b) Think While Watching (ours). The video frames are processed in segments (SEG 1 to SEG 4) to build a continuous segment-level memory. During streaming, questions are answered online by retrieving implicitly relevant memories while continuing to watch. (c) Latency comparison. A schematic timeline showing that interleaved processing accumulates queueing delay, while our decoupled design parallelizes segment processing and answering to reduce latency.
  • Figure 2: Training components of Think While Watching. (a) segment-level streaming attention mask and streaming positional encoding: example input stream $\mathbf{R}=\langle S_1,Q_1,S_2,Q_2,S_3,S_4,Q_3\rangle$ with generated outputs $\mathbf{C}=\langle C_1,\ldots,C_7\rangle$. Green indicates the source prefix available at time step $u$, which $C_u$ is allowed to attend to. Red masks all future segments to prevent information leakage. For positional encoding, we use separate position indices for the input and output streams. (b) Three-stage training strategy: single-round CoT for streaming input adaptation, multi-round CoT for multi-turn interaction, and long-range capability training for long-term memory, uncertainty handling, and distractor learning.
  • Figure 3: Answer attention vs. segment distance on TWW$_{\text{multi-turn}}$. After Stage 3, attention mass shifts from near-history to more distant segments.
  • Figure 4: Ablation under frame masking on TWW$_{\text{multi-turn,S3}}$. Overall represents the accuracy rate. The remaining curves represent the results of the subsets.
  • Figure A1: Decoder-induced ingestion backlog under interleaved streaming. As utilization $\rho$ increases, interleaved decoding pauses can amplify the catch-up delay and enter a backlog explosion regime, while our decoupled design substantially reduces decoder-induced backlog growth.
  • ...and 5 more figures