Table of Contents
Fetching ...

Landmark-Guided Knowledge for Vision-and-Language Navigation

Dongsheng Yang, Meiling Zhu, Yinfeng Yu

TL;DR

This work tackles vision-and-language navigation (VLN) by addressing commonsense gaps that hinder instruction-grounded navigation in unseen environments. It introduces Landmark-Guided Knowledge (LGK), which integrates a large Visual Genome-derived knowledge base with landmark-guided knowledge selection (KGL) and Knowledge-Guided Dynamic Augmentation (KGDA) to fuse language, knowledge, vision, and history. The approach uses CLIP-based knowledge matching and LXMERT-style cross-modal fusion to enrich environmental understanding and improve decision-making, achieving state-of-the-art results on the R2R and REVERIE benchmarks with improvements in navigation error, success rate, and path efficiency. Overall, LGK demonstrates the value of landmark-directed knowledge in enhancing multimodal reasoning for VLN and offers a path toward more robust generalization in complex indoor environments.

Abstract

Vision-and-language navigation is one of the core tasks in embodied intelligence, requiring an agent to autonomously navigate in an unfamiliar environment based on natural language instructions. However, existing methods often fail to match instructions with environmental information in complex scenarios, one reason being the lack of common-sense reasoning ability. This paper proposes a vision-and-language navigation method called Landmark-Guided Knowledge (LGK), which introduces an external knowledge base to assist navigation, addressing the misjudgment issues caused by insufficient common sense in traditional methods. Specifically, we first construct a knowledge base containing 630,000 language descriptions and use knowledge Matching to align environmental subviews with the knowledge base, extracting relevant descriptive knowledge. Next, we design a Knowledge-Guided by Landmark (KGL) mechanism, which guides the agent to focus on the most relevant parts of the knowledge by leveraging landmark information in the instructions, thereby reducing the data bias that may arise from incorporating external knowledge. Finally, we propose Knowledge-Guided Dynamic Augmentation (KGDA), which effectively integrates language, knowledge, vision, and historical information. Experimental results demonstrate that the LGK method outperforms existing state-of-the-art methods on the R2R and REVERIE vision-and-language navigation datasets, particularly in terms of navigation error, success rate, and path efficiency.

Landmark-Guided Knowledge for Vision-and-Language Navigation

TL;DR

This work tackles vision-and-language navigation (VLN) by addressing commonsense gaps that hinder instruction-grounded navigation in unseen environments. It introduces Landmark-Guided Knowledge (LGK), which integrates a large Visual Genome-derived knowledge base with landmark-guided knowledge selection (KGL) and Knowledge-Guided Dynamic Augmentation (KGDA) to fuse language, knowledge, vision, and history. The approach uses CLIP-based knowledge matching and LXMERT-style cross-modal fusion to enrich environmental understanding and improve decision-making, achieving state-of-the-art results on the R2R and REVERIE benchmarks with improvements in navigation error, success rate, and path efficiency. Overall, LGK demonstrates the value of landmark-directed knowledge in enhancing multimodal reasoning for VLN and offers a path toward more robust generalization in complex indoor environments.

Abstract

Vision-and-language navigation is one of the core tasks in embodied intelligence, requiring an agent to autonomously navigate in an unfamiliar environment based on natural language instructions. However, existing methods often fail to match instructions with environmental information in complex scenarios, one reason being the lack of common-sense reasoning ability. This paper proposes a vision-and-language navigation method called Landmark-Guided Knowledge (LGK), which introduces an external knowledge base to assist navigation, addressing the misjudgment issues caused by insufficient common sense in traditional methods. Specifically, we first construct a knowledge base containing 630,000 language descriptions and use knowledge Matching to align environmental subviews with the knowledge base, extracting relevant descriptive knowledge. Next, we design a Knowledge-Guided by Landmark (KGL) mechanism, which guides the agent to focus on the most relevant parts of the knowledge by leveraging landmark information in the instructions, thereby reducing the data bias that may arise from incorporating external knowledge. Finally, we propose Knowledge-Guided Dynamic Augmentation (KGDA), which effectively integrates language, knowledge, vision, and historical information. Experimental results demonstrate that the LGK method outperforms existing state-of-the-art methods on the R2R and REVERIE vision-and-language navigation datasets, particularly in terms of navigation error, success rate, and path efficiency.

Paper Structure

This paper contains 24 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The difference between the LGK method and other approaches. When the agent is confused by identical objects with different attributes, LGK assists navigation by introducing knowledge.
  • Figure 2: Overview of the proposed Landmark-Guided Knowledge Network (LGK) structure. The network incorporates an external knowledge base to assist the agent in navigation, focusing on three key components designed around the knowledge base: (1) Knowledge Matching, (2) Knowledge-Guided by Landmark, and (3) Knowledge-Guided Dynamic Augmentation.
  • Figure 3: Overview of Knowledge Matching, using CLIP to encode both environmental and knowledge information, followed by cosine similarity calculation.
  • Figure 4: (a) Word cloud of the REVERIE dataset and (b) word cloud of the knowledge base.
  • Figure 5: Knowledge Guided by Landmark.
  • ...and 2 more figures