Table of Contents
Fetching ...

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava

TL;DR

This work introduces an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information and proposes a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames.

Abstract

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

TL;DR

This work introduces an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information and proposes a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames.

Abstract

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
Paper Structure (32 sections, 9 equations, 13 figures, 11 tables)

This paper contains 32 sections, 9 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: (a) We propose constructing the key--value cache with sparse sliding-window attention and design an adaptive key selection (AKS) strategy to sparsify the sliding window. (b) During question-answering, we merge complementary retrieval signals from external models via a training-free mixture-of-experts.
  • Figure 2: Increasing per-frame token budget leads to substantial declines in average layer-wise recall across a variety of different questions.
  • Figure 3: In (a), we identify a systematic trend where query-frame similarity scores progressively increase across the video. In (b), we observe that self-similarity maps of the key representations become more redundant as we increase tokens per frame.
  • Figure 4: We compute the normalized entropy for 677 sliding windows under 2 separate token budgets. At the higher token budget, the sliding window attention tends to exhibit higher entropy. Rather than increased informativeness, this suggests a struggle to focus on relevant frames at higher token budgets.
  • Figure 5: We measure the recall for which, given a question, each layer retrieves the features for the CG-Bench "clue" frames. There is a massive variance in recall scores, but in general, they tend to be quite low.
  • ...and 8 more figures