$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
Saul Santos, António Farinhas, Daniel C. McNamee, André F. T. Martins
TL;DR
The paper tackles the bottleneck of long-video understanding by extending short-context video LMs with a continuous-time long-term memory (LTM) via continuous attention. It introduces $\infty$-Video, which fuses a discrete short-term memory (STM) with a global, unbounded LTM through a Gibbs-density attention mechanism over time, allowing training-free processing of arbitrarily long videos in a single pass. Key innovations include sticky memories that adaptively allocate memory density to salient regions and a principled memory consolidation strategy that contracts past information while incorporating new frames. Empirically, the approach improves performance on video QA benchmarks across Video-LLaMA and VideoChat2 while maintaining efficiency, demonstrating scalable long-context video understanding without additional training.
Abstract
Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces $\infty$-Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
