REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

Sakib Reza; Xiyun Song; Heather Yu; Zongfang Lin; Mohsen Moghaddam; Octavia Camps

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

Sakib Reza, Xiyun Song, Heather Yu, Zongfang Lin, Mohsen Moghaddam, Octavia Camps

TL;DR

REEF addresses the inefficiency of memory-based video-Language Models for untrimmed video understanding by introducing a relevance-aware temporal compression (RTC) and spatial token filtering (STF) strategy. The framework employs a frozen visual encoder and a Q-Former with Visual and Query Memory Banks, guided by a lightweight Relevance Scorer to selectively compress memory and filter tokens, using a differentiable Top-K for end-to-end training. Across untrimmed video classification, video question answering, and video captioning, REEF achieves competitive or state-of-the-art accuracy on four datasets while reducing GFLOPs by up to 34%, demonstrating improved efficiency without sacrificing performance. The approach promises practical impact for scalable vision-language systems, enabling more efficient real-time video understanding with large language models.

Abstract

Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$\unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$\unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

TL;DR

Abstract

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)