Agentic Keyframe Search for Video Question Answering

Sunqi Fan; Meng-Hao Guo; Shuojin Yang

Agentic Keyframe Search for Video Question Answering

Sunqi Fan, Meng-Hao Guo, Shuojin Yang

TL;DR

This work tackles efficient video understanding for VideoQA by introducing Agentic Keyframe Search (AKeyS), a language-agent-guided, tree-structured search that retrieves keyframes while discarding redundant frames. By modeling the search as adaptive node expansion with cost functions such as $f(n)=g(n)+h(n)$ for A* (and variants for GBFS and Dijkstra), AKeyS uses a base LLM to reason about missing information and guide frame selection. Evaluations on EgoSchema and NExT-QA show that AKeyS achieves state-of-the-art accuracy in a training-free setting while processing only a fraction of frames, demonstrating superior frame efficiency compared to prior methods like VideoTree. This method advances practical video understanding by combining interpretable reasoning with selective visual processing, paving the way for efficient, agentic video analytics.

Abstract

Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Agentic Keyframe Search for Video Question Answering

TL;DR

Abstract

Agentic Keyframe Search for Video Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)