Table of Contents
Fetching ...

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Xingxing Zuo, Yaoxian Song, Haoang Li

TL;DR

This work proposes an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task and introduces STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism.

Abstract

Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

TL;DR

This work proposes an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task and introduces STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism.

Abstract

Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.
Paper Structure (31 sections, 10 equations, 5 figures, 6 tables)

This paper contains 31 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Various VLN modeling paradigms.(a) Conventional VLN methods design a planner based directly on instructions and visual observations hong2021vln, often lacking a deep understanding of complex indoor environmental knowledge. (b) Previous knowledge-enhanced VLN models introduce entity knowledge embeddings song2024scene as priors but fail to effectively associate objects and scenes with dynamic navigation actions. (c) Our method constructs video-based event knowledge for the VLN planner. It empowers the robot with explicit knowledge priors that uniquely capture the spatiotemporal features linking objects, scenes, and navigation actions.
  • Figure 2: Overview of YouTube‑Event‑KG (YE-KG)(a) Multimodal Event Knowledge Graph Construction. We first collect large-scale real-world indoor tour videos from YouTube, followed by visual frame extraction and scene segmentation. Then, we employ LLaVA-NeXT-Video together with GPT-4 to generate semantically grounded event descriptions as nodes, connecting them via directed edges representing temporal adjacency (causal navigational links) to yield the YE-KG. (b) Data Statistics. The video duration distribution in the knowledge graph is tightly concentrated between 6 and 13 seconds, highlighting the corpus’s high quality and temporal consistency. (c) YE‑KG Visualization. An example event is visualized together with its associated video features, illustrating how event nodes are grounded in spatiotemporal visual context.
  • Figure 3: The overview of STE-VLN framework enhanced by the multimodal event knowledge. (a) The texture instruction and egocentric visual observation are fed into the VLN planner, enriched by external knowledge. The red dashed line frame illustrates the coarse-to-fine retrieval process: $\{\hat{T}_{\text{event}}^{(i)}\}_{i=1}^n$ and $\{\hat{\mathcal{F}}_i\}_{i=1}^{n}$ denote the top-$n$ retrieved textual event descriptions and their corresponding video features from the YE-KG, which are fused with the original instruction $I$ and current visual observation $f_t^v$, respectively. (b) Knowledge-Guided Transformer is designed to implement ASTFF, where the egocentric observation serves as the Query (Q) and retrieved video knowledge provides the Key (K) and Value (V).
  • Figure 4: Visualization of event-level knowledge–guided decision correction in unseen VLN scenarios. By retrieving related event sequences and visual cues via a coarse-to-fine mechanism, our method compensates for missing procedural priors and guides the agent to the target. The light orange arrows represent the trajectories of the GOAT wang2024vision baseline, which erroneously leads to a bathtub instead of the target sink.
  • Figure 5: Real-world deployment. The agent robustly follows natural language instructions in real office settings. (a) "I am thirsty, find the water for me": navigation to the water dispenser. (b) "Go to the sofa, then find the green box": localization of the sofa followed by retrieval of the green box. Top rows: first-person views; bottom rows: third-person views, illustrating seamless traversal of diverse functional areas.