Table of Contents
Fetching ...

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan

TL;DR

This work tackles zero-shot vision-and-language navigation by explicitly modeling action demands embedded in language instructions. It introduces A^2Nav, a two-component system consisting of an instruction parser based on a large language model and an action-aware navigation policy with five specialized navigators trained via zero-shot, image-goal data inspired by ZSON. By decomposing instructions into action-specific sub-tasks (e.g., GoTo, GoPast, GoInto, GoThrough, Exit) and learning corresponding navigators, A^2Nav achieves competitive zero-shot VLN performance, even surpassing some supervised methods on RxR-Habitat and showing strong generalization via CSR. The approach demonstrates the value of aligning navigation behavior with instruction semantics, enabling more accurate and explainable execution in unseen environments.

Abstract

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

TL;DR

This work tackles zero-shot vision-and-language navigation by explicitly modeling action demands embedded in language instructions. It introduces A^2Nav, a two-component system consisting of an instruction parser based on a large language model and an action-aware navigation policy with five specialized navigators trained via zero-shot, image-goal data inspired by ZSON. By decomposing instructions into action-specific sub-tasks (e.g., GoTo, GoPast, GoInto, GoThrough, Exit) and learning corresponding navigators, A^2Nav achieves competitive zero-shot VLN performance, even surpassing some supervised methods on RxR-Habitat and showing strong generalization via CSR. The approach demonstrates the value of aligning navigation behavior with instruction semantics, enabling more accurate and explainable execution in unseen environments.

Abstract

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method (Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.
Paper Structure (33 sections, 1 equation, 13 figures, 4 tables)

This paper contains 33 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Existing zero-shot VLN methods navigate to the front of the landmarks sequentially, overlooking the action demands in the instruction. Our A$^2$Nav correctly parses action demands from the instruction and accurately executes them for successfully following navigation instruction.
  • Figure 2: General scheme of A$^2$Nav for zero-shot VLN task. A$^2$Nav consists of an instruction parser for decomposing an instruction into action-specific object navigation sub-task sequence, and an action-aware navigation policy for executing these sub-tasks sequentially.
  • Figure 3: Visualization of different sub-task types. For different action demands, the landmark is located at a different position related to the path.
  • Figure 4: Training and inference pipeline of action-specific object navigators.
  • Figure 5: Comparison with the supervised learning methods that are trained on partial training data.
  • ...and 8 more figures