Table of Contents
Fetching ...

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

Saket Gurukar, Asim Kadav

TL;DR

This work tackles the inefficiency of long-form video understanding by introducing Long-VMNet, a memory-augmented architecture that stores a fixed-size set of discriminative video tokens in a per-video memory. A differentiable neural sampler populates this memory in a single pass over the video, enabling fast, memory-based query answering with an encoder–decoder that contextualizes memory against Relational Space-Time Queries (ReST) such as activity, object, and time. The method achieves 18x–75x inference speedups on Rest-ADL while maintaining competitive accuracy, and features an online continual learning loss to mitigate sampling bias, all enabling edge-deployable, long-duration video understanding. By combining fixed-memory representations with selective token sampling, Long-VMNet offers scalable, efficient long-form video processing suitable for real-time querying and retrieval over videos that span tens of minutes to hours.

Abstract

Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

TL;DR

This work tackles the inefficiency of long-form video understanding by introducing Long-VMNet, a memory-augmented architecture that stores a fixed-size set of discriminative video tokens in a per-video memory. A differentiable neural sampler populates this memory in a single pass over the video, enabling fast, memory-based query answering with an encoder–decoder that contextualizes memory against Relational Space-Time Queries (ReST) such as activity, object, and time. The method achieves 18x–75x inference speedups on Rest-ADL while maintaining competitive accuracy, and features an online continual learning loss to mitigate sampling bias, all enabling edge-deployable, long-duration video understanding. By combining fixed-memory representations with selective token sampling, Long-VMNet offers scalable, efficient long-form video processing suitable for real-time querying and retrieval over videos that span tens of minutes to hours.

Abstract

Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: shows a activity query video and a histogram of the number of times a single frame is reloaded in GPU memory for the video from the video understanding, ReST-ADL dataset (FPS=1).
  • Figure 2: Overview of the Long-VMNet training : We sample tokens from input videos and store them in memory to efficiently process long-form videos. The inference steps are shown in the Figure \ref{['fig:inference_diag']}.
  • Figure 3: Two stage Inference pipeline of Long-VMNet .
  • Figure 4: Auxillary online continual learning loss (shown in red color).