FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu; Hailun Xu; Yang Luo; Yong Liu; Kanchan Sarkar; Zhenheng Yang; Yang You

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

TL;DR

Long videos pose token-budget challenges for multimodal LLMs. We introduce FOCUS, a training-free keyframe selector that casts frame selection as a combinatorial pure-exploration problem in a multi-armed bandit, with a two-stage, batched exploration to locate informative temporal regions and choose top frames within them. By employing clip-level arms, empirical means, and Bernstein confidence radii, FOCUS achieves high-utility frame subsets under strict budgets and is shown to improve QA accuracy on LongVideoBench and Video-MME across multiple backbones while using less than 2% of frames. The approach is modular, scalable, and accompanied by reproducibility resources (code at GitHub), offering a practical path to scalable long-video understanding with multimodal large language models.

Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

FOCUS: Efficient Keyframe Selection for Long Video Understanding

TL;DR

Abstract

FOCUS: Efficient Keyframe Selection for Long Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (4)