Table of Contents
Fetching ...

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

Ganlong Zhao, Guanbin Li, Weikai Chen, Yizhou Yu

TL;DR

Over-NAV introduces a structured representation, coded Omnigraph, to effectively integrate multi-modal information along the tour and introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training.

Abstract

Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task, it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end, we propose OVER-NAV, which aims to go over and beyond the current arts of IVLN techniques. In particular, we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data, we further introduce a structured representation, coded Omnigraph, to effectively integrate multi-modal information along the tour. Accompanied with a novel omnigraph fusion mechanism, OVER-NAV is able to extract the most relevant knowledge from omnigraph for a more accurate navigating action. In addition, OVER-NAV seamlessly supports both discrete and continuous environments under a unified framework. We demonstrate the superiority of OVER-NAV in extensive experiments.

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

TL;DR

Over-NAV introduces a structured representation, coded Omnigraph, to effectively integrate multi-modal information along the tour and introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training.

Abstract

Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task, it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end, we propose OVER-NAV, which aims to go over and beyond the current arts of IVLN techniques. In particular, we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data, we further introduce a structured representation, coded Omnigraph, to effectively integrate multi-modal information along the tour. Accompanied with a novel omnigraph fusion mechanism, OVER-NAV is able to extract the most relevant knowledge from omnigraph for a more accurate navigating action. In addition, OVER-NAV seamlessly supports both discrete and continuous environments under a unified framework. We demonstrate the superiority of OVER-NAV in extensive experiments.
Paper Structure (20 sections, 2 equations, 5 figures, 7 tables)

This paper contains 20 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Top: Example of a two-episode tour. The agent first navigates the environment following the instruction of episode 1 (Yellow). Then the agent is directed to the ground truth goal as Oracle Goal phase (Red). Later the agent travels to the start point of episode 2 in the Oracle Start phase (Blue). Finally, the agent navigates the environment following the next instruction (Yellow). Bottom: Comparison between previous methods and ours. Close-vocabulary methods require extra annotation and training efforts to provide segmentation results in navigation, and the agent is limited to a close set of categories building segmentation maps. Our method proposed an open-vocabulary-based omnigraph which is more flexible for various keywords and circumstances.
  • Figure 2: The overview of our proposed method. The instruction is sent to LLMs with the prompt to obtain keywords. The open-vocabulary detector receives the keywords and the panoramic view at the current position, and sends the detection results to the agent. With the detection results containing the distribution of detected objects, e.g., heading and confidence, the agent maintains the omnigraph that stores the information of visited viewpoints in previous episodes. Each viewpoint is tagged with keywords and distribution information. For inference, the omnigraph first collects the neighboring viewpoints and filters their keywords, then fuses the keywords with corresponding positional information, e.g., heading and confidence. Finally, the resulting positional keyword inputs are sent to the agent for prediction.
  • Figure 3: The visualization of Omnigraph in our method during a 100-episode tour. As the tour proceeds, the omnigraph becomes larger with more viewpoints and more connections. The keywords attached to viewpoints become more precise and diverse. We show 3 keywords for each viewpoint at most and omit the extra information (e.g., heading) for simplicity.
  • Figure 4: The overview of our method combined with HAMT.
  • Figure 5: The overview of our method combined with MAP-CMA.