Table of Contents
Fetching ...

Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation

Yaohua Liu, Xinyuan Song, Yunfu Deng, Yifan Xie, Binkai Ou, Yan Zhong

TL;DR

We address Vision-and-Language Navigation (VLN) where misalignment between visual/angular cues and language instructions hinders navigation. We propose OIKG, a fine-grained instruction-guided graph reasoning framework that combines an observation–graph interaction module with a fine-grained instruction guidance module to decouple cues and align linguistic semantics with navigable trajectories. Theoretical results show that angular–visual decoupling with geometric embeddings reduces gradient variance and that location/object cue extraction increases mutual information with the ground-truth path. Empirically, OIKG achieves state-of-the-art performance on R2R and RxR benchmarks, demonstrating improved spatial reasoning and cross-modal alignment in complex, long-horizon VLN tasks.

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to traverse complex environments by following natural language instructions, demanding accurate alignment between visual observations and linguistic guidance. Despite recent progress, existing methods typically encode visual and directional cues in a coupled manner, and process instructions without explicitly extracting navigation-critical semantics, which often leads to imprecise spatial reasoning and suboptimal cross-modal alignment. To address these challenges, we propose a fine-grained instruction-guided graph reasoning framework (OIKG) that enhances both spatial representation and instruction understanding during navigation. Specifically, an observation-graph interaction mechanism is introduced to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding, enabling more reliable spatial reasoning within the navigation graph. In addition, a fine-grained instruction guidance module is designed to explicitly extract and leverage location-specific and object-centric information from language instructions, facilitating more precise cross-modal alignment between linguistic semantics and navigable trajectories. By jointly integrating structured graph reasoning with instruction-critical semantic cues, the proposed approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR benchmarks demonstrate that our method consistently achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of fine-grained instruction-guided graph reasoning for vision-and-language navigation.

Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation

TL;DR

We address Vision-and-Language Navigation (VLN) where misalignment between visual/angular cues and language instructions hinders navigation. We propose OIKG, a fine-grained instruction-guided graph reasoning framework that combines an observation–graph interaction module with a fine-grained instruction guidance module to decouple cues and align linguistic semantics with navigable trajectories. Theoretical results show that angular–visual decoupling with geometric embeddings reduces gradient variance and that location/object cue extraction increases mutual information with the ground-truth path. Empirically, OIKG achieves state-of-the-art performance on R2R and RxR benchmarks, demonstrating improved spatial reasoning and cross-modal alignment in complex, long-horizon VLN tasks.

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to traverse complex environments by following natural language instructions, demanding accurate alignment between visual observations and linguistic guidance. Despite recent progress, existing methods typically encode visual and directional cues in a coupled manner, and process instructions without explicitly extracting navigation-critical semantics, which often leads to imprecise spatial reasoning and suboptimal cross-modal alignment. To address these challenges, we propose a fine-grained instruction-guided graph reasoning framework (OIKG) that enhances both spatial representation and instruction understanding during navigation. Specifically, an observation-graph interaction mechanism is introduced to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding, enabling more reliable spatial reasoning within the navigation graph. In addition, a fine-grained instruction guidance module is designed to explicitly extract and leverage location-specific and object-centric information from language instructions, facilitating more precise cross-modal alignment between linguistic semantics and navigable trajectories. By jointly integrating structured graph reasoning with instruction-critical semantic cues, the proposed approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR benchmarks demonstrate that our method consistently achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of fine-grained instruction-guided graph reasoning for vision-and-language navigation.

Paper Structure

This paper contains 23 sections, 2 theorems, 18 equations, 4 figures, 3 tables.

Key Result

Theorem 1

Let $F'_o$ and $F'_g$ be as defined in Definition def:decouple, Definition def:angle, and Definition def:geom. Consider the navigation loss $\mathcal{L}$ jointly optimized with respect to $F'_o$ and $F'_g$. Then for some constant $\delta > 0$, where the expectation is taken over the training data distribution.

Figures (4)

  • Figure 1: Illustration of our proposed OIKG architecture. At time step $t$, given the observation $O_t$ and the path graph $G_t$, the observation features $F_o$ and graph features $F_g$ are extracted respectively, where the Observation-Graph Interaction module is employed to strengthen the edge representation and update the candidate nodes. Then, we design the Key-Detail Guidance module to further extract the detailed information from the text instruction $I$, and enhance the alignment between instruction and navigation path. Finally, a Candidate Selection module updates the path graph $G_{t+1}$ for the next step.
  • Figure 2: The process of extracting location and object details from the original instruction.
  • Figure 3: Qualitative results on the R2R Dataset. The green box indicates the location details, and the blue box represents the object details.
  • Figure 4: Comparison of sDTW on the RxR val seen split and the RxR val unseen split. OIKG demonstrates the ability to achieve optimal sDTW across various splits.

Theorems & Definitions (5)

  • Definition 1: Observation Feature Decoupling
  • Definition 2: Angular Distance Embedding
  • Definition 3: Geometric Positional Embedding
  • Theorem 1: Effectiveness of Decoupling and Geometric Embedding
  • Theorem 2: Fine-Grained Information Gain and Alignment Accuracy of the Key-Detail Guidance Module