Table of Contents
Fetching ...

StreamReady: Learning What to Answer and When in Long Streaming Videos

Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat

TL;DR

This work introduces StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding, and introduces ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions.

Abstract

Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.

StreamReady: Learning What to Answer and When in Long Streaming Videos

TL;DR

This work introduces StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding, and introduces ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions.

Abstract

Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
Paper Structure (25 sections, 11 equations, 12 figures, 14 tables)

This paper contains 25 sections, 11 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Readiness-aware streaming video understanding.Left: In proactive streaming settings, questions can precede their supporting evidence, requiring the model to monitor the evolving video and answer once the relevant cues appear. Right: Under our readiness-aware formulation, effective accuracy jointly reflects answer correctness and timing via the Answer Readiness Score (ARS). Although all models achieve similar raw accuracy on this example, ARS reveals sharp performance drops for early (hallucinatory) or late (delayed) answers. In contrast, StreamReady responds within the evidence window, preserving high effective accuracy by answering at the appropriate moment.
  • Figure 2: Framework Overview. StreamReady encodes streaming videos into a visual memory tree and reasons through short and long-term branches. A learnable <RDY> token, guided by a readiness head, gates the reasoning output until sufficient evidence is observed. Once ready, the long-term representation, enriched with contextual information from past QA pairs, is sent to the LLM for answering, enabling readiness-aware streaming behavior.
  • Figure 3: Examples of each task in ProReady-QA. Here, the question and answer frames are color-coded.
  • Figure 4: Generation pipeline of ProReady-QA.
  • Figure 5: Penalty sharpness for early and late responses.Left: Readiness curves for different penalty strengths. Right: Resulting ARS, with selected $\gamma_e, \gamma_\ell$ combination highlighted.
  • ...and 7 more figures