Table of Contents
Fetching ...

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye

TL;DR

Long videos overwhelm multimodal LLMs due to token limits, risking loss of critical information. The authors propose Adaptive Keyframe Sampling (AKS), a plug-in that selects M keyframes by jointly optimizing prompt relevance and temporal coverage, implemented via an adaptive recursive binning strategy (ADA). Across LongVideoBench and VideoMME, AKS yields consistent gains over baselines and even surpasses some larger models, underscoring the value of pre-filtering informative frames. The work advocates information pre-filtering as a core step for robust long-video understanding and demonstrates broad applicability to other video tasks.

Abstract

Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.

Adaptive Keyframe Sampling for Long Video Understanding

TL;DR

Long videos overwhelm multimodal LLMs due to token limits, risking loss of critical information. The authors propose Adaptive Keyframe Sampling (AKS), a plug-in that selects M keyframes by jointly optimizing prompt relevance and temporal coverage, implemented via an adaptive recursive binning strategy (ADA). Across LongVideoBench and VideoMME, AKS yields consistent gains over baselines and even surpasses some larger models, underscoring the value of pre-filtering informative frames. The work advocates information pre-filtering as a core step for robust long-video understanding and demonstrates broad applicability to other video tasks.

Abstract

Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The accuracy of video-based MLLMs heavily relies on the quality of keyframes. The above example shows a long video from VideoMME fu2024video where keyframes are marked with green stars. The same MLLM (i.e., LLaVA-Video zhang2024video) is used for answering the question. Uniform sampling (the default setting in zhang2024video) finds irrelevant frames (the MLLM mostly performs a random guess), while our algorithm (AKS) finds keyframes and produces the correct answer.
  • Figure 2: The overall framework of our approach. We insert a plug-and-play module, Adaptive Keyframe Sampling (AKS, marked in green frames) into the MLLM to improve the quality of sampled keyframes. Each red dot indicates a prompt-frame matching score (i.e., $s(\mathbf{Q},\mathbf{F}_t)$, see Section \ref{['method:principles']}). AKS follows a recursive, judge-and-split optimization for keyframe selection (see Section \ref{['method:optimization']}).
  • Figure 3: An example of adaptive sampling (ADA). $8$ keyframes are to be selected from the input video. Each red dot indicates a prompt-frame matching score, $s(\mathbf{Q},\mathbf{F}_t)$. At Level-$0$ and Level-$1$, all bins are split into two sub-bins; at Level-$2$, only the rightmost bin is further partitioned while the top-$2$ scores are sampled from the other three bins. Level-$3$ has reached the maximal depth.
  • Figure 4: AKS improves the baseline MLLMs for video understanding. The left three examples come from LongVideoBench while the right three come from VideoMME. Green stars indicate keyframes selected by AKS (note that $64$ keyframes are selected for each video).
  • Figure 5: Two examples of how different sampling strategies impact video understanding. The left case comes from LongVideoBench (focusing on one moment) and the right one comes from VideoMME (relying on multiple moments). Each curve shows the $s(\mathbf{Q},\mathbf{F}_t)$ score over time, and the yellow circles indicate the position of sampled keyframes. We also annotate the number of true keyframes and the reason for each failure case below the answer.
  • ...and 3 more figures