Table of Contents
Fetching ...

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

Sakib Reza, Xiyun Song, Heather Yu, Zongfang Lin, Mohsen Moghaddam, Octavia Camps

TL;DR

REEF addresses the inefficiency of memory-based video-Language Models for untrimmed video understanding by introducing a relevance-aware temporal compression (RTC) and spatial token filtering (STF) strategy. The framework employs a frozen visual encoder and a Q-Former with Visual and Query Memory Banks, guided by a lightweight Relevance Scorer to selectively compress memory and filter tokens, using a differentiable Top-K for end-to-end training. Across untrimmed video classification, video question answering, and video captioning, REEF achieves competitive or state-of-the-art accuracy on four datasets while reducing GFLOPs by up to 34%, demonstrating improved efficiency without sacrificing performance. The approach promises practical impact for scalable vision-language systems, enabling more efficient real-time video understanding with large language models.

Abstract

Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$\unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$\unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

TL;DR

REEF addresses the inefficiency of memory-based video-Language Models for untrimmed video understanding by introducing a relevance-aware temporal compression (RTC) and spatial token filtering (STF) strategy. The framework employs a frozen visual encoder and a Q-Former with Visual and Query Memory Banks, guided by a lightweight Relevance Scorer to selectively compress memory and filter tokens, using a differentiable Top-K for end-to-end training. Across untrimmed video classification, video question answering, and video captioning, REEF achieves competitive or state-of-the-art accuracy on four datasets while reducing GFLOPs by up to 34%, demonstrating improved efficiency without sacrificing performance. The approach promises practical impact for scalable vision-language systems, enabling more efficient real-time video understanding with large language models.

Abstract

Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks untrimmed video classification, video question answering, and video captioningour method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustrative example of relevance-aware memory bank compression. In this example, the goal is to compress the memory bank by merging a pair of frames. While existing similarity-based memory compression methods he2024masong2024moviechat would merge pair (a) due to its slightly higher similarity, this approach risks losing critical action information and retaining irrelevant context from (b). In contrast, our method makes more relevance-aware merging decisions, preserving the distinctiveness and contextual significance of the memory bank. Note that, in practice, memory banks consist of human non-interpretable feature tokens, not frames.
  • Figure 2: The proposed LLM-based REEF framework for video understanding. Encoded frame visual features are stored in the visual memory bank over time, with the STF module removing redundant and irrelevant spatial information. These filtered features are then sent to the Q-Former for temporal modeling and alignment with the text domain. The RTC module collects token relevance scores from both the TTS and STF modules, compressing the memory bank by preserving only the most relevant and distinctive information based on adjacent similarity and relevance, ensuring the memory size remains fixed.
  • Figure 3: Steps of the proposed Relevance-aware Temporal Compression (RTC). This approach preserves the most discriminative and relevant features while maintaining the temporal order.
  • Figure 4: Conceptual overview of our spatial token filtering method. This approach preserves the most discriminative and relevant regions based on predefined anchors. In practice, the filtering is applied to spatial tokens within the temporal memory bank, rather than directly on the video frames.
  • Figure 5: Qualitative evaluation of our REEF model across three video understanding tasks, compared to the baseline MA-LMM.
  • ...and 2 more figures