Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan
TL;DR
Q-Frame tackles long-form video understanding by making frame selection query-aware and adapting resolution to content and query without training. It employs a CLIP-based cross-modal retrieval to score frames, uses the Gumbel-Max trick to sample informative frames, and assigns high/medium/low resolutions to balance detail and compute under a token budget. Experiments on MLVU, LongVideoBench, and Video-MME demonstrate consistent improvements over uniform sampling across multiple Video-LLMs, especially for longer videos and query-dependent tasks. The method is plug-and-play and model-agnostic, suggesting broad applicability for Video-LLMs in diverse video understanding tasks.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.
