Table of Contents
Fetching ...

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang

TL;DR

This work introduces ReKV, a training-free framework for streaming video question-answering that integrates with existing Video-LLMs. By combining sliding-window attention for video encoding with in-context KV-Cache retrieval and RAM/disk offloading, ReKV enables real-time responses while preserving long-term video context. Extensive offline and streaming evaluations show that ReKV improves accuracy on long-form benchmarks and maintains low latency and memory usage, outperforming several memory-based streaming baselines. The approach also demonstrates robustness across multiple Video-LLMs and benchmarks, underscoring its practicality for real-world streaming scenarios.

Abstract

We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

TL;DR

This work introduces ReKV, a training-free framework for streaming video question-answering that integrates with existing Video-LLMs. By combining sliding-window attention for video encoding with in-context KV-Cache retrieval and RAM/disk offloading, ReKV enables real-time responses while preserving long-term video context. Extensive offline and streaming evaluations show that ReKV improves accuracy on long-form benchmarks and maintains low latency and memory usage, outperforming several memory-based streaming baselines. The approach also demonstrates robustness across multiple Video-LLMs and benchmarks, underscoring its practicality for real-world streaming scenarios.

Abstract

We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.

Paper Structure

This paper contains 20 sections, 4 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the StreamingVQA task and our proposed ReKV. (a) StreamingVQA requires a model to continuously process video streams and answer questions about previously viewed content at any moment. (b) We propose ReKV to enhance efficiency and accuracy in StreamingVQA. Tested with LLaVA-OV-7B on an H800 (80GB) GPU, ReKV maintains stable latency and GPU memory usage, preventing out-of-memory (OOM) errors as frames increase. It also improves the accuracy on seven long-form VideoQA benchmarks compared to the uniform sampling baseline. Further details are provided in Section \ref{['sec:experiments']}.
  • Figure 2: Overview of ReKV. We modify the attention mechanism in Decoder-based Video-LLMs: (a) The video stream is encoded with sliding-window attention (Equation \ref{['equ:video_encoding']}), with out-of-window Video KV-Caches offloaded to RAM or disk. (b) Upon receiving a question, relevant key-value vectors are retrieved based on cosine similarity, with compressed vectors to accelerate retrieval (Equation \ref{['equ:retrieval']}). (c) The retrieved key-value vectors are reloaded onto the GPU and utilized for autoregressive answer generation (Equation \ref{['equ:answer_generation']}).
  • Figure 3: Ablation study of retrieval hyperparameters: (a) number of retrieved frames and (b) number of frames per retrieval block. Experiments are conducted with LLaVA-OV-7B.
  • Figure 4: StreamingVQA qualitative examples. The example is drawn from the QaEgo4D benchmark. The video stream is processed frame by frame. $\CIRCLE$ and $\CIRCLE$ mark the timestamps at which questions are posed. $\square$ and $\square$ indicate the relevant video contexts that support answering these questions.