Table of Contents
Fetching ...

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, Jing Shao

TL;DR

BEVBert introduces a map-based pre-training paradigm for vision-language navigation by building a hybrid topo-metric map that couples a global topological graph with a local metric grid. A cross-modal transformer framework fuses the map representations with instructions, guided by three proxy tasks—MLM, HSAP, and MSI—to learn spatially-aware multimodal representations. Offline map construction precedes online fine-tuning, enabling efficient long-term planning and precise short-term reasoning. Empirically, BEVBert achieves state-of-the-art results on four VLN benchmarks (R2R, R2R-CE, RxR, REVERIE), validating the effectiveness of explicit spatial representations for language-guided navigation.

Abstract

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

TL;DR

BEVBert introduces a map-based pre-training paradigm for vision-language navigation by building a hybrid topo-metric map that couples a global topological graph with a local metric grid. A cross-modal transformer framework fuses the map representations with instructions, guided by three proxy tasks—MLM, HSAP, and MSI—to learn spatially-aware multimodal representations. Offline map construction precedes online fine-tuning, enabling efficient long-term planning and precise short-term reasoning. Empirically, BEVBert achieves state-of-the-art results on four VLN benchmarks (R2R, R2R-CE, RxR, REVERIE), validating the effectiveness of explicit spatial representations for language-guided navigation.

Abstract

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.
Paper Structure (28 sections, 9 equations, 9 figures, 14 tables)

This paper contains 28 sections, 9 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: (a) Incomplete observations within a single view and duplicates across views may confuse the agent. (b) Projecting discrete panoramas into a unified map can solve the problem, thus facilitating spatial reasoning.
  • Figure 2: The main architecture of the proposed hybrid-map-based pre-training framework.
  • Figure 3: Online topo map update at step $t$. The agent executes an action to reach a ghost node and receives new observations. It then adds newly observed nodes to $\mathbf{G}_{t-1}$, updating node representations and types. The simulator provides navigable nodes at each step.
  • Figure 4: Comparison of navigation performance on spatial and numerical related instructions (BEVBert vs. DUETchen2022thinkSR (light color) and SPL (dark color) on R2R val unseen split, BEVBert vs. EnvEditli2022enveditSR and SDTW on RxR val unseen split).
  • Figure 5: Predicted paths of DUET chen2022think and BEVBert on R2R-unseen. Yellow and green circles denote the start and target locations, respectively, and the red circles represent incorrect endpoints.
  • ...and 4 more figures