Table of Contents
Fetching ...

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

TL;DR

FocusGraph is developed, a framework for keyframe selection for question answering over long egocentric videos that leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips.

Abstract

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

TL;DR

FocusGraph is developed, a framework for keyframe selection for question answering over long egocentric videos that leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips.

Abstract

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.
Paper Structure (12 sections, 7 equations, 3 figures, 5 tables)

This paper contains 12 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: We propose FocusGraph, a modular method that addresses navigation and question-answering tasks on long videos using hierarchical textual scene graphs and a training-free fast key frame selection method called PSFR.
  • Figure 2: FocusGraph overview. FocusGraph takes egocentric video as input and splits it into clips with a fixed number of frames. Then, for each clip, a pretrained MLLM constructs an object-centric representation of the scene in the form of a hierarchical textual scene graph, containing a list of objects and a scene description. We also store the time range of the original video from which a clip is extracted. Next, a time-augmented clip caption is generated from each graph-based clip description, which is then projected into the Scene-Caption LLM Selector. The Scene-Caption LLM Selector selects the clips that contain the answer to the question. From the selected clips, K key frames are chosen using the proposed training-free PSFR algorithm. The K key frames are then fed into an MLLM, which uses them to solve various tasks, such as navigation goal selection (FindingDory) or multi-choice question answering (HourVideo).
  • Figure 3: Comparison of frame sampling methods (Finding Dory: episode 2, task 49). Green-bordered frames are present in the ground truth; red-bordered frames are not.