Table of Contents
Fetching ...

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding

Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, Jifei Song

TL;DR

EgoGraph is introduced, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams and develops a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning.

Abstract

Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding

TL;DR

EgoGraph is introduced, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams and develops a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning.

Abstract

Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.
Paper Structure (16 sections, 6 equations, 5 figures, 4 tables)

This paper contains 16 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between hierarchical and graph-based methods for ultra-long egocentric video understanding. The hierarchical method summarizes each clip-level segment independently, whereas our graph-based approach constructs an entity-centric memory that models long-term dependencies across temporally separated events.
  • Figure 2: The pipeline of our proposed EgoGraph. EgoGraph encodes ultra-long video into a temporal-aware event knowledge graph, which imitates human brain memory processing.
  • Figure 3: Quantitative comparison on temporal reasoning tasks. EgoGraph outperforms EgoGPT across temporal aggregation, temporal dependency, and entity tracking, achieving an average improvement of 29.3%.
  • Figure 4: Qualitative comparison of question answering on multi-day egocentric videos. EgoGPT lacks explicit temporal grounding and incorrectly infers activities. EgoGraph retrieves temporally-filtered knowledge graph relations with specific timestamps, enabling accurate answers about daily routines.
  • Figure 5: Robustness analysis of EgoGraph on long-term video understanding. (Top) scalability with growing context and (Bottom) temporal gap sensitivity.