Table of Contents
Fetching ...

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

Yulin Zhang, Cheng Shi, Sibei Yang

TL;DR

A simple, efficient, and model agnostic framework that first teaches order and then uses order, and a lightweight Temporal Reconstruction objective-the authors' Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data is introduced.

Abstract

Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

TL;DR

A simple, efficient, and model agnostic framework that first teaches order and then uses order, and a lightweight Temporal Reconstruction objective-the authors' Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data is introduced.

Abstract

Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
Paper Structure (17 sections, 9 equations, 7 figures, 5 tables)

This paper contains 17 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustrative examples of two coupled challenges in streaming Video-LLMs stemming from Time-Agnosticism. Top: Temporal Order Ambiguity. The model struggles to correctly interpret the temporal sequence of events (e.g., entering vs. leaving a room), leading to erroneous spatial inferences (e.g., mislocating the orange flowers). Bottom: Past–Current Focus Blindness. The model fails to dynamically allocate attention between the immediate observation and relevant past memories. For the "What color is the flower in the painting now?" query, the answer is in the current frame, but the model recalls an irrelevant past moment. Conversely, for the "Where is the full body mirror?" query, which requires historical context, the model fixates on the current frame, leading to an incorrect answer.
  • Figure 2: Impact of ground-truth answer-window shift on Video-QA accuracy. This heatmap shows model accuracy when the ground-truth (GT) answer window is shifted to different relative positions (0--100%) along the video, evaluated over videos grouped by binned length (e.g., $345$ s, $480$ s). Systematic changes in accuracy across shift positions and lengths reveal a temporal positional bias, whereby the model prefers specific time locations rather than consistently locating evidence based on the query.
  • Figure 3: Overview of WeaveTime. (Left) The VideoLLM-SOPE enhances temporal perception by reconstructing correct frame order from shuffled inputs during training. (Right) The Past–Current Dynamic Focus (PCDF) Cache adaptively controls memory retrieval at inference, balancing between immediate observations and recalled past content. Together, these components jointly mitigate temporal ambiguity and inefficient memory access, enabling robust and temporally coherent streaming reasoning.
  • Figure 4: Overview of the Past–Current Dynamic Focus Cache (PCDF-Cache). (a) Given streaming inputs, the PCDF-Cache monitors prediction entropy: when uncertainty is low, the model answers directly from the current observation; when high, it triggers a recall from long-term memory via Coarse-to-Fine Recall. (b) The recall module performs hierarchical selection—first pooling coarse candidates, then applying a max-similarity criterion for fine retrieval—balancing retrieval cost and contextual accuracy.
  • Figure 5: Ablation on the entropy threshold in PCDF-Cache. On OvO-Bench liOVOBenchHowFar2025, the threshold controls when memory recall is triggered. Accuracy (blue) peaks at 0.6, balancing current observation and past memory, while response latency (green) decreases with larger thresholds, yielding the best accuracy–efficiency trade-off.
  • ...and 2 more figures