Table of Contents
Fetching ...

Agentic Keyframe Search for Video Question Answering

Sunqi Fan, Meng-Hao Guo, Shuojin Yang

TL;DR

This work tackles efficient video understanding for VideoQA by introducing Agentic Keyframe Search (AKeyS), a language-agent-guided, tree-structured search that retrieves keyframes while discarding redundant frames. By modeling the search as adaptive node expansion with cost functions such as $f(n)=g(n)+h(n)$ for A* (and variants for GBFS and Dijkstra), AKeyS uses a base LLM to reason about missing information and guide frame selection. Evaluations on EgoSchema and NExT-QA show that AKeyS achieves state-of-the-art accuracy in a training-free setting while processing only a fraction of frames, demonstrating superior frame efficiency compared to prior methods like VideoTree. This method advances practical video understanding by combining interpretable reasoning with selective visual processing, paving the way for efficient, agentic video analytics.

Abstract

Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Agentic Keyframe Search for Video Question Answering

TL;DR

This work tackles efficient video understanding for VideoQA by introducing Agentic Keyframe Search (AKeyS), a language-agent-guided, tree-structured search that retrieves keyframes while discarding redundant frames. By modeling the search as adaptive node expansion with cost functions such as for A* (and variants for GBFS and Dijkstra), AKeyS uses a base LLM to reason about missing information and guide frame selection. Evaluations on EgoSchema and NExT-QA show that AKeyS achieves state-of-the-art accuracy in a training-free setting while processing only a fraction of frames, demonstrating superior frame efficiency compared to prior methods like VideoTree. This method advances practical video understanding by combining interpretable reasoning with selective visual processing, paving the way for efficient, agentic video analytics.

Abstract

Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Paper Structure

This paper contains 16 sections, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Demonstration of AKeyS's high frame efficiency. When processing the same number of video frames with the same (M)LLM, AKeyS achieves higher QA accuracy. At the same accuracy level (66%), AKeyS uses only about 1/4 of the frames required by VideoTree. Moreover, VideoTree clusters features of all frames during preprocessing, whereas AKeyS only has access to visible frames and does not utilize information from the rest. This experiment is conducted on EgoSchema mangalam2023egoschemadiagnosticbenchmarklongform subset.
  • Figure 2: Comparison of three methods for analyzing a travel vlog: (1) Video-LLM can generate correct answers but is highly token-intensive; (2) The method of uniform frame sampling may introduce irrelevant content, leading MLLM to incorrect predictions; (3) The method of keyframe sampling for MLLM achieves both accuracy and efficiency. The keyframes relevant to the given question are highlighted in the figure.
  • Figure 3: Illustration of AKeyS's cost function evaluation and node expansion steps.
  • Figure 4: Visualization of tree-search process of a case from EgoSchema mangalam2023egoschemadiagnosticbenchmarklongform.