Table of Contents
Fetching ...

Correctable Landmark Discovery via Large Models for Vision-Language Navigation

Bingqian Lin, Yunshuang Nie, Ziming Wei, Yi Zhu, Hang Xu, Shikui Ma, Jianzhuang Liu, Xiaodan Liang

TL;DR

CONSOLE reframes Vision-Language Navigation as open-world sequential landmark discovery, leveraging ChatGPT to generate landmark cooccurrence priors and CLIP to perform discovery aligned to observations. A learnable cooccurrence scoring module and a consistency-based objective suppress noisy priors, enabling accurate landmark-guided action decisions. An observation enhancement step fuses corrected landmark features with observations to be compatible with model-agnostic VLN agents. Across R2R, REVERIE, R4R, and RxR, CONSOLE yields strong gains, setting new state-of-the-art unseen performance on R2R and R4R and illustrating the value of integrating open-world knowledge into embodied navigation.

Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at https://github.com/expectorlin/CONSOLE.

Correctable Landmark Discovery via Large Models for Vision-Language Navigation

TL;DR

CONSOLE reframes Vision-Language Navigation as open-world sequential landmark discovery, leveraging ChatGPT to generate landmark cooccurrence priors and CLIP to perform discovery aligned to observations. A learnable cooccurrence scoring module and a consistency-based objective suppress noisy priors, enabling accurate landmark-guided action decisions. An observation enhancement step fuses corrected landmark features with observations to be compatible with model-agnostic VLN agents. Across R2R, REVERIE, R4R, and RxR, CONSOLE yields strong gains, setting new state-of-the-art unseen performance on R2R and R4R and illustrating the value of integrating open-world knowledge into embodied navigation.

Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at https://github.com/expectorlin/CONSOLE.
Paper Structure (21 sections, 18 equations, 8 figures, 8 tables)

This paper contains 21 sections, 18 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Correctable landmark discovery of CONSOLE. We provide customized prompts for ChatGPT to generate landmark cooccurrence priors. Then we introduce a learnable cooccurrence scoring module to conduct CLIP-driven correctable landmark discovery based on the priors. The landmark/cooccurrence with bigger font size has a higher score.
  • Figure 2: Overview of CONSOLE. Before navigation, the landmark cooccurrence priors $U$ are obtained through the landmark cooccurrence prior generation module. At navigation timestep $t$, the agent conducts the correctable landmark discovery based on the landmark shifting, the landmark discovery, and the learnable cooccurrence scoring. An observation enhancing module is introduced to enhance the observation features $\mathbf{f}_{O_{t}}$ using the corrected landmark features $\mathbf{f}_{U_{t}}$. And the enhanced observations $\mathbf{f'}_{O_{t}}$ are used for action decision. Besides the navigation loss $\mathcal{L}_{\mathrm{nav}}$, we introduce the consistency loss $\mathcal{L}_{\mathrm{cs}}$ and the contrastive loss $\mathcal{L}_{\mathrm{ct}}$ for optimization.
  • Figure 3: Numbers of Landmarks and Cooccurrences.
  • Figure 4: Visualization examples of action decision. The ground-truth (GT) action and the corrected landmark prediction are denoted in the green boxes.
  • Figure 5: Visualization example of landmark shifting. The ground-truth action (GT) and the mentioned landmark are annotated in the green boxes and orange boxes, respectively.
  • ...and 3 more figures