Table of Contents
Fetching ...

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun

TL;DR

AdaRD-Key is proposed, a training-free keyframe sampling module for query-driven long-form video understanding that maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames.

Abstract

Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

TL;DR

AdaRD-Key is proposed, a training-free keyframe sampling module for query-driven long-form video understanding that maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames.

Abstract

Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

Paper Structure

This paper contains 18 sections, 24 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Video-MME results. Left: Category-wise accuracy: AdaRD-Key (ours) forms the outer envelope vs. Uniform and AKS (larger radius is better). Right: Two long-video examples (Temporal Reasoning, 42 min; Information Synopsis, 32 min) where our selected keyframes capture the needed evidence and yield correct answers.
  • Figure 2: Overview of the proposed framework. Video frames are sampled and scored against the query using BLIP-2. A lightweight relevance-aware gate determines whether to adopt a relevance--diversity or diversity-only strategy. The RD-MV selector then greedily selects keyframes to maximize $R(f)+\lambda D(F)$, which are fed into the VLM for downstream tasks.
  • Figure 3: Qualitative results on LongVideoBench using 64 keyframes. Blue text highlights scene constraints; red text marks critical evidence. AdaRD-Key (Ours) captures key details missed by Uniform and AKS, leading to correct answers with LLaVA-Video.
  • Figure 4: Performance of Qwen2-VL with a $K=32$ frame budget under three sampling strategies: Uniform, AKS, and AdaRD-Key (ours), for the query shown at the top. For each method, the left speech bubble shows the model's predicted answer option and its correctness (✓/✗) using the frames selected by that sampler. The middle plot shows the relevance curve with selected points (red); orange shaded spans mark ground-truth keyframe regions, and the bracketed number under each row reports how many selected frames fall inside these regions. Thumbnails on the right display the sampled frames; red borders highlight key evidence.
  • Figure 5: Performance of Qwen2-VL on Video-MME with a $K=32$ frame budget under three sampling strategies—Uniform, AKS, and AdaRD-Key (ours)—for the query shown at the top. For each method, the left speech bubble reports the model’s predicted answer and its correctness (correct/incorrect) using the frames returned by that sampler. The middle plot shows the relevance curve with selected points (red). Orange shaded spans denote the union of multiple disjoint ground-truth keyframe regions distributed across the video; the bracketed number under each row counts how many selected frames fall inside any of these spans. Thumbnails on the right depict the sampled frames; red borders highlight key evidence.
  • ...and 2 more figures