Table of Contents
Fetching ...

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener

TL;DR

ProVideLLM tackles real-time procedural video understanding by introducing a memory-efficient streaming LLM that interleaves verbalized long-term history with short-term visual context in a single multimodal cache. A hand-focused DETR-QFormer connector strengthens fine-grained hand-object reasoning, while online verbalization dramatically reduces token counts and memory usage, enabling sub-linear compute scaling for long-form videos. The approach achieves state-of-the-art results on six tasks across four datasets and supports real-time per-frame inference at 10 FPS and streaming dialogue at 25 FPS, all with modest hardware requirements. These innovations offer practical impact for real-time procedural guidance and AR-assisted tasks, while maintaining a unified framework that handles recognition, anticipation, and planning within a single model.

Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

TL;DR

ProVideLLM tackles real-time procedural video understanding by introducing a memory-efficient streaming LLM that interleaves verbalized long-term history with short-term visual context in a single multimodal cache. A hand-focused DETR-QFormer connector strengthens fine-grained hand-object reasoning, while online verbalization dramatically reduces token counts and memory usage, enabling sub-linear compute scaling for long-form videos. The approach achieves state-of-the-art results on six tasks across four datasets and supports real-time per-frame inference at 10 FPS and streaming dialogue at 25 FPS, all with modest hardware requirements. These innovations offer practical impact for real-time procedural guidance and AR-assisted tasks, while maintaining a unified framework that handles recognition, anticipation, and planning within a single model.

Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of ProVideLLM. A streaming video large language model for real-time procedural video tasks with a low memory footprint. It features a multimodal interleaved cache composed of textual descriptions of long-term observations spanning several minutes and visual tokens representing short-term observations spanning a few seconds. We also introduce a new DETR-QFormer connector for better fine-grained tokenization of the short-term. ProVideLLM is capable of handling multiple procedural tasks within a single model.
  • Figure 2: Converting short-term token to long-term under various caching strategies.(a) Factorizing the streaming observations into long-term and short-term and ordering them progressively. (a.1) Conversion done at $O(1)$ cost but is unable to represent long-form videos due to massive token count and memory needs. (a.2) Online verbalization can represent long-form videos at a manageable memory but is not suitable for streaming due to $O(N_S^2+N_L)$ conversion cost. (b) Interleaving them reduces the conversion cost to $O(N)$, allowing streaming inference on long-form videos.
  • Figure 3: a) Per-class mean temporal variance on EgoExo4D & Ego4D Goal-Step. Highly temporally varying classes on the left and low-variance classes on the right. b) Visualization of the top-3 PCA components of patch tokens computed across video frames, thresholded by first component. The primary factors of variation for DINOv2 oquab2023dinov2 are hands, and are unfocused and scattered across the image for language-aligned encoders, CLIP radford2021learning and SigLIP zhai2023sigmoid.
  • Figure 4: DETR-QFormer architecture & Stage-1 pre-training. Trained to detect hands and objects-in-contact from pseudo ground truths bboxes generated by shan2020understanding, DETR-QFormer queries DINOv2 patch tokens to extract hand-object-focused activations.
  • Figure 5: Scaling of memory, runtime and context length under various caching strategies. Results reported on a subset of Ego4D-GoalStep online action detection validation set for a memory budget of 8GB on a single A6000 GPU.