Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation
Yaohua Liu, Xinyuan Song, Yunfu Deng, Yifan Xie, Binkai Ou, Yan Zhong
TL;DR
We address Vision-and-Language Navigation (VLN) where misalignment between visual/angular cues and language instructions hinders navigation. We propose OIKG, a fine-grained instruction-guided graph reasoning framework that combines an observation–graph interaction module with a fine-grained instruction guidance module to decouple cues and align linguistic semantics with navigable trajectories. Theoretical results show that angular–visual decoupling with geometric embeddings reduces gradient variance and that location/object cue extraction increases mutual information with the ground-truth path. Empirically, OIKG achieves state-of-the-art performance on R2R and RxR benchmarks, demonstrating improved spatial reasoning and cross-modal alignment in complex, long-horizon VLN tasks.
Abstract
Vision-and-Language Navigation (VLN) requires an embodied agent to traverse complex environments by following natural language instructions, demanding accurate alignment between visual observations and linguistic guidance. Despite recent progress, existing methods typically encode visual and directional cues in a coupled manner, and process instructions without explicitly extracting navigation-critical semantics, which often leads to imprecise spatial reasoning and suboptimal cross-modal alignment. To address these challenges, we propose a fine-grained instruction-guided graph reasoning framework (OIKG) that enhances both spatial representation and instruction understanding during navigation. Specifically, an observation-graph interaction mechanism is introduced to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding, enabling more reliable spatial reasoning within the navigation graph. In addition, a fine-grained instruction guidance module is designed to explicitly extract and leverage location-specific and object-centric information from language instructions, facilitating more precise cross-modal alignment between linguistic semantics and navigable trajectories. By jointly integrating structured graph reasoning with instruction-critical semantic cues, the proposed approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR benchmarks demonstrate that our method consistently achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of fine-grained instruction-guided graph reasoning for vision-and-language navigation.
