Table of Contents
Fetching ...

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao

TL;DR

LiveVLM presents a training-free framework for streaming, online video understanding that maintains long-term memory through a streaming-oriented KV cache and an online question-answering workflow. It pre-encodes and compresses video KVs to form a memory-efficient cache, and retrieves query-relevant KVs from long-term memory while using a short-term sliding window for detail. The approach achieves substantial gains in frame throughput ($44\times$) and real-time latency ($\times$ several-fold) over state-of-the-art online methods, while maintaining or improving accuracy, and it demonstrates strong performance on both streaming and offline long-video benchmarks. This work enables scalable, real-time video understanding for applications like autonomous systems and streaming services by balancing performance, memory overhead, and speed.

Abstract

Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

TL;DR

LiveVLM presents a training-free framework for streaming, online video understanding that maintains long-term memory through a streaming-oriented KV cache and an online question-answering workflow. It pre-encodes and compresses video KVs to form a memory-efficient cache, and retrieves query-relevant KVs from long-term memory while using a short-term sliding window for detail. The approach achieves substantial gains in frame throughput () and real-time latency ( several-fold) over state-of-the-art online methods, while maintaining or improving accuracy, and it demonstrates strong performance on both streaming and offline long-video benchmarks. This work enables scalable, real-time video understanding for applications like autonomous systems and streaming services by balancing performance, memory overhead, and speed.

Abstract

Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose , a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44 number of frames on the same device, and achieves up to 5 speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.

Paper Structure

This paper contains 27 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Comparison between SoTA online video LLMs and our LiveVLM. A typical online video LLM feeds all visual features and text tokens into the backbone LLM, resulting in quadratic computational complexity that limits the length of video content it can process. In contrast, LiveVLM continuously generates and compresses KVs for video streams during online processing, and selects query-relevant KVs when a new question comes, improving response speed due to pre-computation and KV selection, while retaining long-term information without exceeding memory capacity. (b) Comparison between ReKV rekv and LiveVLM. LiveVLM achieves a lower memory usage and a faster response speed as the video length increases, while maintaining the same model performance.
  • Figure 2: The overall workflow of LiveVLM. For input video streams, LiveVLM constructs a streaming-oriented KV cache to continuously process and store video frames in the form of compressed video KVs. Upon the user submitting a new question, LiveVLM gathers short-term and long-term information for online question-answering and generates timely responses.
  • Figure 3: Illustration of online question-answering.
  • Figure 4: Comparison between LiveVLM and SoTA online methods on efficiency.
  • Figure 5: Ablation study of hyperparameters for online KV retrieval. We implement LiveVLM on the Egoschema, MLVU and VideoMME benchmarks with varying number of retrieved chunks.