Table of Contents
Fetching ...

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, Zilong Zheng

TL;DR

VideoLLaMB tackles the practical challenge of long-streaming video understanding by introducing memory-augmented Bridge layers that carry recurrent memory tokens across semantically coherent SceneTiling segments. A memory cache with retrieval preserves past states, enabling efficient long-context encoding with linear GPU memory scaling and a single 8- to 16-frame training regime. The approach achieves state-of-the-art results on multiple long-video benchmarks (e.g., VideoQA, EgoPlan) and demonstrates robustness to substantial length scaling, while enabling training-free streaming captioning. Collectively, VideoLLaMB delivers a scalable, accurate, and cost-effective framework for academic-grade long-video understanding and planning tasks.

Abstract

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel and efficient framework for long video understanding that leverages recurrent memory bridges and temporal memory tokens to enable seamless encoding of entire video sequences with preserved semantic continuity. Central to our approach is a SceneTiling algorithm that segments videos into coherent semantic units, facilitating robust understanding across tasks without requiring additional training. VideoLLaMB achieves state-of-the-art performance, surpassing existing models by 4.2 points on four VideoQA benchmarks and by 2.06 points on egocentric planning tasks. Notably, it maintains strong performance under extreme video length scaling (up to 8 times) and excels at fine-grained frame retrieval on our proposed Needle in a Video Haystack (NIAVH) benchmark. With linear GPU memory scaling, VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU, despite being trained on only 16 frames-offering an unprecedented balance of accuracy, scalability, and cost-effectiveness. This makes it highly accessible and practical for the academic community.

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

TL;DR

VideoLLaMB tackles the practical challenge of long-streaming video understanding by introducing memory-augmented Bridge layers that carry recurrent memory tokens across semantically coherent SceneTiling segments. A memory cache with retrieval preserves past states, enabling efficient long-context encoding with linear GPU memory scaling and a single 8- to 16-frame training regime. The approach achieves state-of-the-art results on multiple long-video benchmarks (e.g., VideoQA, EgoPlan) and demonstrates robustness to substantial length scaling, while enabling training-free streaming captioning. Collectively, VideoLLaMB delivers a scalable, accurate, and cost-effective framework for academic-grade long-video understanding and planning tasks.

Abstract

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel and efficient framework for long video understanding that leverages recurrent memory bridges and temporal memory tokens to enable seamless encoding of entire video sequences with preserved semantic continuity. Central to our approach is a SceneTiling algorithm that segments videos into coherent semantic units, facilitating robust understanding across tasks without requiring additional training. VideoLLaMB achieves state-of-the-art performance, surpassing existing models by 4.2 points on four VideoQA benchmarks and by 2.06 points on egocentric planning tasks. Notably, it maintains strong performance under extreme video length scaling (up to 8 times) and excels at fine-grained frame retrieval on our proposed Needle in a Video Haystack (NIAVH) benchmark. With linear GPU memory scaling, VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU, despite being trained on only 16 frames-offering an unprecedented balance of accuracy, scalability, and cost-effectiveness. This makes it highly accessible and practical for the academic community.
Paper Structure (42 sections, 7 figures, 11 tables)

This paper contains 42 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overview of VideoLLaMB. We first extract the video features using an off-the-shelf vision encoder, then apply SceneTiling to segment the video into semantic segments ($\S$\ref{['sec:segment']}). Next, we use recurrent memory on these semantic segments to store video information within memory tokens ($\S$\ref{['sec:recurrent']}). We further employ a retrieval mechanism to update the memory tokens and address long-dependency issues($\S$\ref{['sec:retrieval']}). Finally, we project the memory-token-augmented features from the current video segment into the LLM.
  • Figure 2: Length extrapolation results on EgoSchema dataset.
  • Figure 3: Comparison of VideoLLaMB with two long video understanding models on Needle In A Video Haystack (NIAVH). Currently, we set the context length to 320 seconds w.r.t. existing models' ability and set the frame rate to 1 fps to ensure the input contains the needle. The X-axis indicates the video length, and the Y-axis is the depth of the insertion point.
  • Figure 4: GPU Memory Cost. We apply all the experiments on a single NVIDIA A800 GPU.
  • Figure 5: Qualitative results on EgoPlan.
  • ...and 2 more figures