Understanding Long Videos via LLM-Powered Entity Relation Graphs

Meng Chu; Yicong Li; Tat-Seng Chua

Understanding Long Videos via LLM-Powered Entity Relation Graphs

Meng Chu, Yicong Li, Tat-Seng Chua

TL;DR

GraphVideoAgent tackles the problem of long-form video understanding by integrating a dynamic video knowledge graph with an LLM-powered reasoning agent. The approach captures evolving entity relations across time and uses a graph-guided frame retrieval strategy to answer questions efficiently. Empirical results on EgoSchema and NExT-QA show state-of-the-art accuracy with only around 8 frames of analysis on average, highlighting both accuracy and computational efficiency. This work advances structured semantic memory for video understanding and points toward real-time, multi-modal extensions.

Abstract

The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.

Understanding Long Videos via LLM-Powered Entity Relation Graphs

TL;DR

Abstract

Understanding Long Videos via LLM-Powered Entity Relation Graphs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)