Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener
TL;DR
ProVideLLM tackles real-time procedural video understanding by introducing a memory-efficient streaming LLM that interleaves verbalized long-term history with short-term visual context in a single multimodal cache. A hand-focused DETR-QFormer connector strengthens fine-grained hand-object reasoning, while online verbalization dramatically reduces token counts and memory usage, enabling sub-linear compute scaling for long-form videos. The approach achieves state-of-the-art results on six tasks across four datasets and supports real-time per-frame inference at 10 FPS and streaming dialogue at 25 FPS, all with modest hardware requirements. These innovations offer practical impact for real-time procedural guidance and AR-assisted tasks, while maintaining a unified framework that handles recognition, anticipation, and planning within a single model.
Abstract
We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.
