Table of Contents
Fetching ...

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation

Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, Xiaodan Liang

TL;DR

A novel zero-shot framework called CorNav is introduced, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions.

Abstract

Understanding and following natural language instructions while navigating through complex, real-world environments poses a significant challenge for general-purpose robots. These environments often include obstacles and pedestrians, making it essential for autonomous agents to possess the capability of self-corrected planning to adjust their actions based on feedback from the surroundings. However, the majority of existing vision-and-language navigation (VLN) methods primarily operate in less realistic simulator settings and do not incorporate environmental feedback into their decision-making processes. To address this gap, we introduce a novel zero-shot framework called CorNav, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions. In addition to the framework, we develop a 3D simulator that renders realistic scenarios using Unreal Engine 5. To evaluate the effectiveness and generalization of navigation agents in a zero-shot multi-task setting, we create a benchmark called NavBench. Extensive experiments demonstrate that CorNav consistently outperforms all baselines by a significant margin across all tasks. On average, CorNav achieves a success rate of 28.1\%, surpassing the best baseline's performance of 20.5\%.

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation

TL;DR

A novel zero-shot framework called CorNav is introduced, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions.

Abstract

Understanding and following natural language instructions while navigating through complex, real-world environments poses a significant challenge for general-purpose robots. These environments often include obstacles and pedestrians, making it essential for autonomous agents to possess the capability of self-corrected planning to adjust their actions based on feedback from the surroundings. However, the majority of existing vision-and-language navigation (VLN) methods primarily operate in less realistic simulator settings and do not incorporate environmental feedback into their decision-making processes. To address this gap, we introduce a novel zero-shot framework called CorNav, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions. In addition to the framework, we develop a 3D simulator that renders realistic scenarios using Unreal Engine 5. To evaluate the effectiveness and generalization of navigation agents in a zero-shot multi-task setting, we create a benchmark called NavBench. Extensive experiments demonstrate that CorNav consistently outperforms all baselines by a significant margin across all tasks. On average, CorNav achieves a success rate of 28.1\%, surpassing the best baseline's performance of 20.5\%.
Paper Structure (40 sections, 3 equations, 13 figures, 8 tables)

This paper contains 40 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Comparison with existing VLN agents. (a) The single agent planning paradigm requires the agent to analyse and make decisions by itself. (b) Multi-agent planning paradigm enables the agent to communicate with multiple experts and perform complex reasoning. (c) Our self-corrected planning considers in-plan or out-of-plan feedback from a near-realistic environment.
  • Figure 2: The overall architecture of our CorNav. After receiving the instruction, the instruction parsing expert extracts landmarks or figures out the needed objects. Then the agent generates the initial plan based on the instruction and information from the instruction parsing expert. The vision perception expert is driven by an image tagging model and an open-vocabulary grounding model, and performs scene understanding given four perspectives. Environmental feedback records both in-plan and out-of-plan feedback, while trajectory history maintains the reasoning process and executed actions. The decision-making expert assists the agent in deciding the final action. Finally, the local policy would plan a path for the robot.
  • Figure 3: The illustration of the acting module in CorNav.
  • Figure 4: Our simulator includes scenes of different difficulty, i.e., restaurant, cafe, nursing room, and home.
  • Figure 5: Supported agents in our simulator. We include agents in a variety of application scenarios, such as humanoid agents, sweeping agents, and delivery agents.
  • ...and 8 more figures