Table of Contents
Fetching ...

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

TL;DR

FALCONEye introduces a training-free, model-agnostic meta-architecture that fuses a Vision-Language Model and a Large Language Model to find and localize answers in hour-long videos through a calibrated confidence-guided exploration. The approach operates on short video clips, progressively zooming into higher-resolution subclips while the LLM reasons over captions and semantics to select promising segments, ultimately outputting an answer with a temporal localization. To evaluate this long-form, open-ended setting, the authors propose FALCON-Bench, a challenging benchmark with one-hour videos, four question categories, and ground-truth temporal windows, along with GPT-assisted evaluation for open-ended questions. Across experiments, FALCONEye with a 7B VLM and a lightweight LLM outperforms comparable open-source baselines and scales favorably in cost, while generalizing to shorter videos and broader VQA tasks, suggesting practical applicability for scalable long-form video reasoning and search.

Abstract

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

TL;DR

FALCONEye introduces a training-free, model-agnostic meta-architecture that fuses a Vision-Language Model and a Large Language Model to find and localize answers in hour-long videos through a calibrated confidence-guided exploration. The approach operates on short video clips, progressively zooming into higher-resolution subclips while the LLM reasons over captions and semantics to select promising segments, ultimately outputting an answer with a temporal localization. To evaluate this long-form, open-ended setting, the authors propose FALCON-Bench, a challenging benchmark with one-hour videos, four question categories, and ground-truth temporal windows, along with GPT-assisted evaluation for open-ended questions. Across experiments, FALCONEye with a 7B VLM and a lightweight LLM outperforms comparable open-source baselines and scales favorably in cost, while generalizing to shorter videos and broader VQA tasks, suggesting practical applicability for scalable long-form video reasoning and search.

Abstract

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

Paper Structure

This paper contains 36 sections, 4 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of our meta-architecture FALCONEye designed to Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs. A VLM pre-processes the video generating captions from small video clips. A LLM iteratively refines the search by focusing on the most promising clips using the captions. The exploration follows a hierarchical reasoning: initially, a broad set of low-resolution frames is analyzed by the VLM for each promising clip, progressively narrowing down to fewer frames with higher resolution as the search concentrates on smaller temporal clips. The LLM uses captions, question semantics, answer completion, and confidence scores to determine whether to continue exploring or end the exploration. Once a high-confidence answer is found, or the search reaches a predefined threshold, FALCONEye outputs the final answer, its confidence score, and the corresponding temporal interval.
  • Figure 2: FALCONEye exploration algorithm. Given a question ( ) and a video, it starts with a global Pre-processing of the video, where the VLM generates short captions ( ) from video clips. Afterwards, the LLM Reasons from the Q and the captions to select some candidate clips to explore. The VLM Evaluates the candidate clips sampling frames varying number and resolution according to the clip duration. We get an answer ( ) and its confidence (Conf.) for each candidate clip. From all { , Conf., } tuples, the LLM Decides wether the final answer has already been found or to continue exploring with two different ways: if some candidate clips are still promising and large enough, we segment them to generate new shorter candidate clips to evaluate. Otherwise, we ask again the LLM to reason removing the caption of the already explored clip.
  • Figure 3: Falcon-Bench question examples for each dataset.
  • Figure 4: Distribution of questions in Falcon-Bench. The left plot, according to dataset sources: MovieChat-1k, SoccerNet, and WalkingTours. The right plot according to question category: TR, VO, TI, and OI.
  • Figure 5: Visualization of GToU metric designed to measure the clip localization/retrieval task in which the answer is contained.
  • ...and 6 more figures