Table of Contents
Fetching ...

$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Saul Santos, António Farinhas, Daniel C. McNamee, André F. T. Martins

TL;DR

The paper tackles the bottleneck of long-video understanding by extending short-context video LMs with a continuous-time long-term memory (LTM) via continuous attention. It introduces $\infty$-Video, which fuses a discrete short-term memory (STM) with a global, unbounded LTM through a Gibbs-density attention mechanism over time, allowing training-free processing of arbitrarily long videos in a single pass. Key innovations include sticky memories that adaptively allocate memory density to salient regions and a principled memory consolidation strategy that contracts past information while incorporating new frames. Empirically, the approach improves performance on video QA benchmarks across Video-LLaMA and VideoChat2 while maintaining efficiency, demonstrating scalable long-context video understanding without additional training.

Abstract

Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces $\infty$-Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.

$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

TL;DR

The paper tackles the bottleneck of long-video understanding by extending short-context video LMs with a continuous-time long-term memory (LTM) via continuous attention. It introduces -Video, which fuses a discrete short-term memory (STM) with a global, unbounded LTM through a Gibbs-density attention mechanism over time, allowing training-free processing of arbitrarily long videos in a single pass. Key innovations include sticky memories that adaptively allocate memory density to salient regions and a principled memory consolidation strategy that contracts past information while incorporating new frames. Empirically, the approach improves performance on video QA benchmarks across Video-LLaMA and VideoChat2 while maintaining efficiency, demonstrating scalable long-context video understanding without additional training.

Abstract

Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces -Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.

Paper Structure

This paper contains 31 sections, 16 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (Left) Overview of $\infty$-Video (our approach) using Video LLaMA zhang2023videollama, which uses an additional spatial Q-former module, and VideoChat2 li2024mvbenchcomprehensivemultimodalvideo. We split the video into frame chunks and apply these models to each chunk. The Video Q-former module combines a weighted average of the STM, which is the attention for an individual chunk, with a continuous LTM that takes into account previous chunks. The outputs of the Video Q-Former are projected and then averaged. The LLM takes as input visual tokens, generated by our modified video Q-former, alongside the corresponding question to obtain the answer. (Right) Examples of $\infty$-Video LLaMA answers, equipped with our LTM, with uniform sampling and sticky memories for short and ultra-long videos. Italicized corresponds to the correct answer, while underlined corresponds to a wrong answer or hallucination.
  • Figure 2: Proposed Memory Consolidation Mechanism.
  • Figure 3: (Top) LTM attention density on the $[0, \tau]$ interval for the Interstellar trailer, using sticky memories in the final chunk of the $\infty$-Video LLaMA video Q-former's last layer. (Bottom) The same attention density map, extended over the full $t$ interval.
  • Figure 4: Highest continuous attention density frames selected using sticky memories in the Interstellar trailer for $\infty$-Video LLaMA across 3 chunks. (Left) Interval: $[0, \tau^2]$. (Middle) Interval: $(\tau^2, \tau]$. (Right) Interval: $(\tau, 1]$.
  • Figure 5: Ablation studies on the MovieChat dataset: Evaluation of accuracy and score metrics for various values of the number of basis functions $N$ and the contribution of long-term memory $\alpha$.
  • ...and 2 more figures