Table of Contents
Fetching ...

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan

TL;DR

Q-Frame tackles long-form video understanding by making frame selection query-aware and adapting resolution to content and query without training. It employs a CLIP-based cross-modal retrieval to score frames, uses the Gumbel-Max trick to sample informative frames, and assigns high/medium/low resolutions to balance detail and compute under a token budget. Experiments on MLVU, LongVideoBench, and Video-MME demonstrate consistent improvements over uniform sampling across multiple Video-LLMs, especially for longer videos and query-dependent tasks. The method is plug-and-play and model-agnostic, suggesting broad applicability for Video-LLMs in diverse video understanding tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

TL;DR

Q-Frame tackles long-form video understanding by making frame selection query-aware and adapting resolution to content and query without training. It employs a CLIP-based cross-modal retrieval to score frames, uses the Gumbel-Max trick to sample informative frames, and assigns high/medium/low resolutions to balance detail and compute under a token budget. Experiments on MLVU, LongVideoBench, and Video-MME demonstrate consistent improvements over uniform sampling across multiple Video-LLMs, especially for longer videos and query-dependent tasks. The method is plug-and-play and model-agnostic, suggesting broad applicability for Video-LLMs in diverse video understanding tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

Paper Structure

This paper contains 24 sections, 8 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison of uniform frame sampling and proposed Q-Frame sampling. Uniform Frame Sampling selects frames at fixed intervals, leading to sparse and potentially irrelevant frame selections that disrupt temporal continuity. In contrast, Q-Frame dynamically selects query-aware frames and adapts their resolution, ensuring that the most relevant frames are selected with optimized resolution to preserve crucial visual details. This adaptive approach enhances the Video-LLMs' ability to understand long-form videos more efficiently and effectively, addressing the limitations of traditional frame sampling methods.
  • Figure 2: Overall accuracy (%) Video-MME (without subtitles) for Qwen2-VL, comparing uniform sampling and Q-Frame selected from 128 frames. Note that Qwen2-VL-Video is the video understanding model, and Qwen2-VL is a multi-image understanding model based on different activation weights.
  • Figure 3: The overall framework of Q-Frame. Q-Frame is composed of Cross-modal Query Retrieval (CQR), Query-Aware Frame Selection (QFS), and Multi-Resolution Adaptation (MRA). CQR focuses on retrieving the most semantically relevant frames from the video based on the textual query, ensuring that only meaningful visual information is considered. QFS is designed to adaptively select frames based on their relevance to the query, enhancing efficiency by concentrating on the most important temporal segments. MRA aims to optimize computational resources by assigning varying resolutions to frames, preserving fine details in important frames while reducing costs for less critical ones. It should be noted that the preprocessing strategy of Video-LLMs is different, and the MRA in the dotted line is not applicable to every model.
  • Figure 4: Accuracies (%) of Qwen2-VL-Video, Qwen2-VL, and Q-Frame on six tasks in Video-MME. The maximum results for each task are highlighted.
  • Figure 5: Case analysis from Video-MME fu2024video. Uniform sampling captures only a limited number of frames relevant to the query. While our Q-Frame extracts more relevant frames with a variety of resolutions.
  • ...and 4 more figures