Landmark-Guided Knowledge for Vision-and-Language Navigation

Dongsheng Yang; Meiling Zhu; Yinfeng Yu

Landmark-Guided Knowledge for Vision-and-Language Navigation

Dongsheng Yang, Meiling Zhu, Yinfeng Yu

TL;DR

This work tackles vision-and-language navigation (VLN) by addressing commonsense gaps that hinder instruction-grounded navigation in unseen environments. It introduces Landmark-Guided Knowledge (LGK), which integrates a large Visual Genome-derived knowledge base with landmark-guided knowledge selection (KGL) and Knowledge-Guided Dynamic Augmentation (KGDA) to fuse language, knowledge, vision, and history. The approach uses CLIP-based knowledge matching and LXMERT-style cross-modal fusion to enrich environmental understanding and improve decision-making, achieving state-of-the-art results on the R2R and REVERIE benchmarks with improvements in navigation error, success rate, and path efficiency. Overall, LGK demonstrates the value of landmark-directed knowledge in enhancing multimodal reasoning for VLN and offers a path toward more robust generalization in complex indoor environments.

Abstract

Vision-and-language navigation is one of the core tasks in embodied intelligence, requiring an agent to autonomously navigate in an unfamiliar environment based on natural language instructions. However, existing methods often fail to match instructions with environmental information in complex scenarios, one reason being the lack of common-sense reasoning ability. This paper proposes a vision-and-language navigation method called Landmark-Guided Knowledge (LGK), which introduces an external knowledge base to assist navigation, addressing the misjudgment issues caused by insufficient common sense in traditional methods. Specifically, we first construct a knowledge base containing 630,000 language descriptions and use knowledge Matching to align environmental subviews with the knowledge base, extracting relevant descriptive knowledge. Next, we design a Knowledge-Guided by Landmark (KGL) mechanism, which guides the agent to focus on the most relevant parts of the knowledge by leveraging landmark information in the instructions, thereby reducing the data bias that may arise from incorporating external knowledge. Finally, we propose Knowledge-Guided Dynamic Augmentation (KGDA), which effectively integrates language, knowledge, vision, and historical information. Experimental results demonstrate that the LGK method outperforms existing state-of-the-art methods on the R2R and REVERIE vision-and-language navigation datasets, particularly in terms of navigation error, success rate, and path efficiency.

Landmark-Guided Knowledge for Vision-and-Language Navigation

TL;DR

Abstract

Landmark-Guided Knowledge for Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)