Table of Contents
Fetching ...

AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

Yuanbin Man, Ying Huang, Chengming Zhang, Bingzhe Li, Wei Niu, Miao Yin

TL;DR

This work tackles memory bottlenecks in extremely long-term video understanding by introducing AdaCM$^2$, an adaptive cross-modality memory reduction framework that integrates cross-attention between visual tokens and text prompts into the visual encoder via a Q-Former. By regressing learnable queries frame-by-frame and maintaining a layer-aware video cache, AdaCM$^2$ preserves only tokens highly correlated with the text, with memory growth converging to a finite bound $|\mathbf{K}_T| \to \frac{P r}{1 - r}$ as $T \to \infty$, where $r = \alpha + (1 - \alpha)\beta$. The approach yields state-of-the-art results on long-term video tasks (e.g., LVU) while dramatically reducing GPU memory usage (up to 65%), and demonstrates strong capabilities on VQA and captioning datasets. The method is plug-and-play for BLIP-based models and offers practical scalability for processing extremely long videos with limited resources.

Abstract

The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

TL;DR

This work tackles memory bottlenecks in extremely long-term video understanding by introducing AdaCM, an adaptive cross-modality memory reduction framework that integrates cross-attention between visual tokens and text prompts into the visual encoder via a Q-Former. By regressing learnable queries frame-by-frame and maintaining a layer-aware video cache, AdaCM preserves only tokens highly correlated with the text, with memory growth converging to a finite bound as , where . The approach yields state-of-the-art results on long-term video tasks (e.g., LVU) while dramatically reducing GPU memory usage (up to 65%), and demonstrates strong capabilities on VQA and captioning datasets. The method is plug-and-play for BLIP-based models and offers practical scalability for processing extremely long videos with limited resources.

Abstract

The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

Paper Structure

This paper contains 15 sections, 8 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: (Left) Existing approaches compress visual features of videos via single-modality correlation; (Right) Our AdaCM$^2$ reduces video memory adaptively based on cross-modality attention.
  • Figure 2: The case study of AdaCM$^2$zero-shot on Ego4Dgrauman2022ego4d dataset. As shown, AdaCM$^2$ can 1) summarize an extremely long video lasting over 2 hours with limited memory consumption and identify the number on a person’s back at the end accurately, 2) answer questions related to a mid-length video spanning more than 20 minutes.
  • Figure 3: Visualization for cross-modality attention, generated using a randomly sampled video from the MSR-VTT 7780940 dataset. (a) Cross-attention score map of the 74th frame in the final layer and last head. (b) Cross-attention score distribution of the 80th frame in the final layer and last head. (c) The layer-wise cosine similarities of attention scores between the current frame and adjacent frames.
  • Figure 4: The framework of AdaCM$^2$. With video and text query as input, AdaCM$^2$ first utilizes a visual encoder to extract visual features from video frames. Then, video Q-Former embeds the correlation between visual features and the text prompt into a learnable query in a regressive manner. Finally, LLM generates the answer based on the length-limited query embedding. To reduce memory consumption challenge during the process of Adaptive Memory Reduction, the Video Cache is partitioned into previous and recent parts. Based on cross-modality attention score, AdaCM$^2$ then identifies important visual features and removes layer-wise unimportant visual tokens from cache. The snowflake denotes frozen pre-trained models, while the fire tag represents models that are fine-tuned.
  • Figure 5: Illustration for our video memory reduction. The video cache is first partitioned into recent and previous parts. Important visual tokens with high cross-modality attention scores in the previous cache are then preserved.
  • ...and 3 more figures