LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao
TL;DR
LiveVLM presents a training-free framework for streaming, online video understanding that maintains long-term memory through a streaming-oriented KV cache and an online question-answering workflow. It pre-encodes and compresses video KVs to form a memory-efficient cache, and retrieves query-relevant KVs from long-term memory while using a short-term sliding window for detail. The approach achieves substantial gains in frame throughput ($44\times$) and real-time latency ($\times$ several-fold) over state-of-the-art online methods, while maintaining or improving accuracy, and it demonstrates strong performance on both streaming and offline long-video benchmarks. This work enables scalable, real-time video understanding for applications like autonomous systems and streaming services by balancing performance, memory overhead, and speed.
Abstract
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
