FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo
TL;DR
FALCONEye introduces a training-free, model-agnostic meta-architecture that fuses a Vision-Language Model and a Large Language Model to find and localize answers in hour-long videos through a calibrated confidence-guided exploration. The approach operates on short video clips, progressively zooming into higher-resolution subclips while the LLM reasons over captions and semantics to select promising segments, ultimately outputting an answer with a temporal localization. To evaluate this long-form, open-ended setting, the authors propose FALCON-Bench, a challenging benchmark with one-hour videos, four question categories, and ground-truth temporal windows, along with GPT-assisted evaluation for open-ended questions. Across experiments, FALCONEye with a 7B VLM and a lightweight LLM outperforms comparable open-source baselines and scales favorably in cost, while generalizing to shorter videos and broader VQA tasks, suggesting practical applicability for scalable long-form video reasoning and search.
Abstract
Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.
