BEVBert: Multimodal Map Pre-training for Language-guided Navigation
Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, Jing Shao
TL;DR
BEVBert introduces a map-based pre-training paradigm for vision-language navigation by building a hybrid topo-metric map that couples a global topological graph with a local metric grid. A cross-modal transformer framework fuses the map representations with instructions, guided by three proxy tasks—MLM, HSAP, and MSI—to learn spatially-aware multimodal representations. Offline map construction precedes online fine-tuning, enabling efficient long-term planning and precise short-term reasoning. Empirically, BEVBert achieves state-of-the-art results on four VLN benchmarks (R2R, R2R-CE, RxR, REVERIE), validating the effectiveness of explicit spatial representations for language-guided navigation.
Abstract
Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.
