Table of Contents
Fetching ...

Understanding Long Videos via LLM-Powered Entity Relation Graphs

Meng Chu, Yicong Li, Tat-Seng Chua

TL;DR

GraphVideoAgent tackles the problem of long-form video understanding by integrating a dynamic video knowledge graph with an LLM-powered reasoning agent. The approach captures evolving entity relations across time and uses a graph-guided frame retrieval strategy to answer questions efficiently. Empirical results on EgoSchema and NExT-QA show state-of-the-art accuracy with only around 8 frames of analysis on average, highlighting both accuracy and computational efficiency. This work advances structured semantic memory for video understanding and points toward real-time, multi-modal extensions.

Abstract

The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.

Understanding Long Videos via LLM-Powered Entity Relation Graphs

TL;DR

GraphVideoAgent tackles the problem of long-form video understanding by integrating a dynamic video knowledge graph with an LLM-powered reasoning agent. The approach captures evolving entity relations across time and uses a graph-guided frame retrieval strategy to answer questions efficiently. Empirical results on EgoSchema and NExT-QA show state-of-the-art accuracy with only around 8 frames of analysis on average, highlighting both accuracy and computational efficiency. This work advances structured semantic memory for video understanding and points toward real-time, multi-modal extensions.

Abstract

The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Paradigm comparison --- (a) The traditional method employs a frame selector with sequential memory, which processes frames linearly and outputs "The dog shows its angry face towards the person," missing the causal relationship. (b) In contrast, our method combines an LLM Agent with Graph Memory, representing entities (i.e., Dog, Toy, Person) and their interactions through a structured graph.
  • Figure 2: The figure illustrates GraphVideoAgent's architecture, which consists of four main components: (1) an input module that performs uniform sampling from long videos, (2) a dynamic entity-relation graph that tracks entities and their temporal relations, (3) foundation model tools including CLIP, VLM, and frame retrieval for processing video content, and (4) an LLM agent responsible for frame selection, graph updates, and answer generation. These components work together to enable graph-enhanced video understanding capabilities.
  • Figure 3: GraphVideoAgent's video analysis process has multiple components: a sequence of 8 video frames showing interactions between people in an indoor setting (top), a Graph Entity Memory structure (bottom left) that maps relations between entities (people, objects, and actions) and tracks their appearances across frames, and a reasoning process (bottom right) that uses this graph structure to answer questions about the video. The system includes a multiple-choice question interface, graph searching capabilities, and self-reflection mechanisms to evaluate answer confidence. The graph maintains entity relations, state changes, and temporal information to enable accurate video understanding and question answering.
  • Figure 4: LLM ablation.
  • Figure 5: Graph Component Ablation.
  • ...and 3 more figures