Table of Contents
Fetching ...

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

TL;DR

Long videos pose token-budget challenges for multimodal LLMs. We introduce FOCUS, a training-free keyframe selector that casts frame selection as a combinatorial pure-exploration problem in a multi-armed bandit, with a two-stage, batched exploration to locate informative temporal regions and choose top frames within them. By employing clip-level arms, empirical means, and Bernstein confidence radii, FOCUS achieves high-utility frame subsets under strict budgets and is shown to improve QA accuracy on LongVideoBench and Video-MME across multiple backbones while using less than 2% of frames. The approach is modular, scalable, and accompanied by reproducibility resources (code at GitHub), offering a practical path to scalable long-video understanding with multimodal large language models.

Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

FOCUS: Efficient Keyframe Selection for Long Video Understanding

TL;DR

Long videos pose token-budget challenges for multimodal LLMs. We introduce FOCUS, a training-free keyframe selector that casts frame selection as a combinatorial pure-exploration problem in a multi-armed bandit, with a two-stage, batched exploration to locate informative temporal regions and choose top frames within them. By employing clip-level arms, empirical means, and Bernstein confidence radii, FOCUS achieves high-utility frame subsets under strict budgets and is shown to improve QA accuracy on LongVideoBench and Video-MME across multiple backbones while using less than 2% of frames. The approach is modular, scalable, and accompanied by reproducibility resources (code at GitHub), offering a practical path to scalable long-video understanding with multimodal large language models.

Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

Paper Structure

This paper contains 48 sections, 2 theorems, 19 equations, 4 figures, 9 tables, 2 algorithms.

Key Result

Theorem B.1

Let $N_a(n)$ be the number of pulls for arm $a$ at round $n$ and $n = \sum_{a\in\mathcal{A}} N_a(n)$ is the total number of pulls. Let $\hat{\mu}_a(n)$ be the empirical mean of arm $a$ at round $n$ and $\hat{\sigma}_a^2(n)$ be the empirical variance of arm $a$ at round $n$. We define the empirical B Then we have the following bound holds with probability at least $1-\frac{6}{n}$:

Figures (4)

  • Figure 1: Temporal autocorrelation (ACF) of per-frame query relevance on LongVideoBench and Video-MME. We compute frame-level relevance per video and take the ACF over time lags (seconds); solid lines show the median across videos and shaded bands the interquartile range. The dashed line marks the correlation half-life level ($\rho(\delta)=0.5$).
  • Figure 2: Overview of Focus. Focus partitions videos into fixed-length clips as bandit arms, applies optimistic confidence upper-bound arm selection and selects final keyframes within each promising arms.
  • Figure 3: Comparison between uniformly sampled frames and those selected by Focus. The left column shows two examples from LongVideoBench; the right column shows two from Video-MME. Yellow stars indicate manually annotated frames that are most informative to the query, many of which are successfully captured by Focus.
  • Figure 4: Two representative failure modes of LLaVA-Video-7B when using Focus to select keyframes. Yellow stars mark manually annotated frames that are most informative for the query. In the first case, Focus correctly selects these frames, but the MLLM still fails to answer due to its limited ability to reason over the relatively complex chart. In the second case, Focus fails to capture the critical frames during a compact, rapid scene transition: the relevant segment lasts only 1-2 seconds within a 10-minute video, making the keyframes difficult to identify even for human experts.

Theorems & Definitions (4)

  • Theorem B.1
  • proof
  • Theorem C.1
  • proof