Table of Contents
Fetching ...

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

TL;DR

VideoMindPalace introduces a mind-palace-inspired solution for long-form video understanding by structuring content as a three-layer hierarchical graph that anchors temporally dispersed moments to spatial zones and environment layouts. The framework combines human–object interaction graphs, zone discovery via CLIP and camera pose, and scene-layout graphs, with JSON representations that are directly consumable by LLMs to enable grounded spatio-temporal reasoning. A new Video MindPalace Benchmark (VMB) challenges models with spatial localization, temporal reasoning, and layout-aware questions on egocentric videos, complemented by comprehensive experiments on EgoSchema, NExT-QA, IntentQA, AMB, and VMB. Results show state-of-the-art performance across multiple benchmarks, especially on long-form videos, demonstrating improved coherence and human-aligned reasoning while highlighting the importance of structured, interpretable representations for enabling effective LLM-based video analysis in real-world settings.

Abstract

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

TL;DR

VideoMindPalace introduces a mind-palace-inspired solution for long-form video understanding by structuring content as a three-layer hierarchical graph that anchors temporally dispersed moments to spatial zones and environment layouts. The framework combines human–object interaction graphs, zone discovery via CLIP and camera pose, and scene-layout graphs, with JSON representations that are directly consumable by LLMs to enable grounded spatio-temporal reasoning. A new Video MindPalace Benchmark (VMB) challenges models with spatial localization, temporal reasoning, and layout-aware questions on egocentric videos, complemented by comprehensive experiments on EgoSchema, NExT-QA, IntentQA, AMB, and VMB. Results show state-of-the-art performance across multiple benchmarks, especially on long-form videos, demonstrating improved coherence and human-aligned reasoning while highlighting the importance of structured, interpretable representations for enabling effective LLM-based video analysis in real-world settings.

Abstract

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
Paper Structure (21 sections, 2 equations, 5 figures, 5 tables)

This paper contains 21 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (Right) Our VideoMindPalace represents video data as a layered, topological structured graph, where nodes capture spatial concepts (e.g., objects, activity zones, rooms), and edges signify spatiotemporal, layout relationships and human-object interaction. This graph can be represented in JSON format and used as input to text-only LLMs. (Center) VideoTree wang2024videotree extracts query-relevant information by organizing videos as tree structures, with deeper branches capturing finer, query-specific details. A captioner then generates video descriptions from this structure, enabling the LLM to perform reasoning over long videos. (Left) LLoVi zhang2023simple processes videos following temporal order, where visual captioners sequentially generate textual descriptions within each temporal sliding window, which the LLM then aggregates for reasoning.
  • Figure 2: Overview of our VideoMindPalace framework. 1) VideoMindPalace is a three-layered graph with nodes representing spatial concepts (e.g., objects, zones, rooms) and edges capturing spatiotemporal relationships. Layer 1 - Human and Object: Nodes represent the human, and detected objects, with edges denoting spatiotemporal connections and interactions. Layer 2 - Activity Zones: Nodes represent specific activity zones with edges showing 3D spatial relationships. Layer 3 - Scene Layout: Nodes represent rooms with edges for relative distances. 2) This graph can be represented in the JSON format used as input to LLMs. The model’s responses are grounded in the physical scene, enabling it to identify locations, locate items of interest, and understand the topological structure of the space.
  • Figure 3: Qualitative results of VideoMindPalace on the VMB benchmark, with an example for each question type. To explore how VideoMindPalace successfully answers these questions, we prompt GPT-4 to identify the specific parts of the graph that provide sufficient information to answer each question accurately.
  • Figure 4: More qualitative results of VideoMindPalace on the VMB benchmark, showcasing examples for each question type. To demonstrate how VideoMindPalace effectively answers these questions, we leverage GPT-4 to pinpoint specific graph components that provide the necessary information for accurate responses.
  • Figure 5: Query Distribution by Video length and Reasoning Categories.