Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Peihao Chen; Dongyu Ji; Kunyang Lin; Runhao Zeng; Thomas H. Li; Mingkui Tan; Chuang Gan

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

TL;DR

This work tackles VLN-CE by introducing a multi-granularity map that fuses fine-grained object details with semantic information to support instruction grounding. A weakly-supervised instruction-relevant object localization auxiliary task guides the map encoder to produce discriminative representations without manual localization labels. The learned map feeds a waypoint predictor to determine the next navigation goal, yielding state-of-the-art results on VLN-CE benchmarks and demonstrating robustness without panoramas. Limitations include reliance on semantic ground-truth and 2D top-down mapping, with future work focusing on 3D mapping and real-world deployment.

Abstract

We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 8 figures, 8 tables)

This paper contains 27 sections, 7 equations, 8 figures, 8 tables.

Introduction
Related Works
Vision-and-language navigation.
Map representation for navigation.
Weakly-supervised learning for object localization.
Vision-and-Language Navigation using Multi-Granularity Map
Problem formulation
Perceiving environment via multi-granularity map
Weakly-supervised map representation learning via object localization task
Waypoint navigator and overall learning objective
Experiments
Experimental setups
Comparisons with state-of-the-art methods
Ablation studies
Visualization results
...and 12 more sections

Figures (8)

Figure 1: Existing semantic map (a) can only represent a part of environment object classes without attribute details. Our multi-granularity map (b) contains extra fine-grained environment details (e.g., texture, color) and learns to represent diverse objects with detailed attributes through a weakly-supervised object localization task.
Figure 2: General scheme of WS-MGMap for VLN task. We assemble both fine-grained details and semantic information about environments to build a multi-granularity map. Agents learn to leverage such information for representing diverse objects through a weakly-supervised object localization task. The learned map and instruction are then fed to a waypoint navigator for deciding actions.
Figure 3: Visualization of instruction-relevant object localization results.
Figure A: Architecture of semantic hallucination module and map encoder.
Figure B: Architecture of object localization module.
...and 3 more figures

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

TL;DR

Abstract

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)