Table of Contents
Fetching ...

WAT: Online Video Understanding Needs Watching Before Thinking

Zifan Han, Hongbo Sun, Jinglin Xu, Canhui Tang, Yulong Lei, Xuchong Zhang, Hongbin Sun, Zhongjiang He, Hao Sun

Abstract

Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.

WAT: Online Video Understanding Needs Watching Before Thinking

Abstract

Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.
Paper Structure (27 sections, 7 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 7 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustrations of different Video LLMs: (a) Vanilla offline processing, which encodes the uniformly sampled frames at once; (b) Vanilla online processing, which encodes frames sequentially into the space-limited memory; while (c) Our WAT introduces a watching-before-thinking pipeline to enhance online comprehension through hierarchical memory and explicit query-based retrieval reasoning.
  • Figure 2: Overview of the proposed WAT framework, which follows a two-stage pipeline for online streaming video reasoning. Given a continuous video stream $V$ and a textual query $Q$, WAT decouples perception and reasoning into a Watching stage and a Thinking stage. In the Watching stage, WAT maintains a hierarchical memory system consisting of a Short-Term Memory (STM, $\mathcal{M}_S$) and a Long-Term Memory (LTM, $\mathcal{M}_L$). The STM serves as a high-fidelity sliding window, implemented as a First-In-First-Out (FIFO) queue to preserve recent frames and capture fine-grained temporal dynamics. In parallel, the LTM stores a compact yet semantically diverse set of historical features, managed by a redundancy-aware eviction policy that removes over-represented entries while protecting recent content. In the Thinking stage, triggered by a query $Q$, WAT performs context-aware retrieval by first fusing the query with the current STM content to form a conditioned query representation, which is then used to retrieve the top-$K$ most relevant historical features $\mathcal{F}^*$ from the LTM. Finally, the retrieved features $\mathcal{F}^*$, together with the short-term memory $\mathcal{M}_S$ and the query $Q$, are concatenated into a unified multimodal input sequence and fed into a Multimodal Large Language Model (MLLM) to perform coherent cross-temporal reasoning and generate the final response. The retrieval process is trained with an auxiliary Retrieval Alignment Contrastive Learning (RACL) objective. This contrastive loss aligns the relevant features $\mathcal{F}^*$ with the query $Q$ (positive) against random visual evidence $\mathcal{F}^*_R$ (negative).
  • Figure 3: Case study of WAT on a sample video from OVO-Bench demonstrates its online reasoning capabilities. At 59s, when a user poses the query “What does the man do before?”, WAT proactively infers the preceding actions and correctly outputs the past activity, showing its ability to leverage hierarchical memory for temporal context. Furthermore, at 184s, when presented with a follow-up question about the future, “What will the man do next?”, the system similarly produces an accurate prediction, highlighting WAT’s capacity for both backward tracing and forward forecasting in continuous video streams.
  • Figure 4: Visualization comparisons of retrieved visual content and corresponding responses between our method and other state-of-the-art (SOTA) approaches. In the top example, WAT exhibits strong capabilities in long-term temporal reasoning, effectively capturing dependencies across distant frames for online video understanding. This highlights WAT’s advantage in modeling extended temporal contexts, which is crucial for accurate understanding of complex, continuous video streams. The bottom example further demonstrates WAT's robust performance on conventional (short-form) VideoQA tasks, showing precise alignment between relevant visual segments and generated responses.