Table of Contents
Fetching ...

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

Yuxing Long, Xiaoqi Li, Wenzhe Cai, Hao Dong

TL;DR

This paper tackles the challenge of visual language navigation by moving beyond single-round self-thinking models to a zero-shot framework where multiple domain experts, prompted within large language models, discuss and verify information before each move. The DiscussNav agent orchestrates instruction analysis, vision perception, completion estimation, and decision testing through prompted experts, achieving strong performance on R2R and real-robot tasks. Key contributions include the construction of dedicated domain experts, a multi-round discussion workflow, and demonstrated improvements over zero-shot and some trained baselines, along with ablations validating each expert’s role. The approach offers a scalable path to robust embodied navigation by leveraging the collective reasoning of diverse LLM-driven experts.

Abstract

Visual language navigation (VLN) is an embodied task demanding a wide range of skills encompassing understanding, perception, and planning. For such a multifaceted challenge, previous VLN methods totally rely on one model's own thinking to make predictions within one round. However, existing models, even the most advanced large language model GPT4, still struggle with dealing with multiple tasks by single-round self-thinking. In this work, drawing inspiration from the expert consultation meeting, we introduce a novel zero-shot VLN framework. Within this framework, large models possessing distinct abilities are served as domain experts. Our proposed navigation agent, namely DiscussNav, can actively discuss with these experts to collect essential information before moving at every step. These discussions cover critical navigation subtasks like instruction understanding, environment perception, and completion estimation. Through comprehensive experiments, we demonstrate that discussions with domain experts can effectively facilitate navigation by perceiving instruction-relevant information, correcting inadvertent errors, and sifting through in-consistent movement decisions. The performances on the representative VLN task R2R show that our method surpasses the leading zero-shot VLN model by a large margin on all metrics. Additionally, real-robot experiments display the obvious advantages of our method over single-round self-thinking.

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

TL;DR

This paper tackles the challenge of visual language navigation by moving beyond single-round self-thinking models to a zero-shot framework where multiple domain experts, prompted within large language models, discuss and verify information before each move. The DiscussNav agent orchestrates instruction analysis, vision perception, completion estimation, and decision testing through prompted experts, achieving strong performance on R2R and real-robot tasks. Key contributions include the construction of dedicated domain experts, a multi-round discussion workflow, and demonstrated improvements over zero-shot and some trained baselines, along with ablations validating each expert’s role. The approach offers a scalable path to robust embodied navigation by leveraging the collective reasoning of diverse LLM-driven experts.

Abstract

Visual language navigation (VLN) is an embodied task demanding a wide range of skills encompassing understanding, perception, and planning. For such a multifaceted challenge, previous VLN methods totally rely on one model's own thinking to make predictions within one round. However, existing models, even the most advanced large language model GPT4, still struggle with dealing with multiple tasks by single-round self-thinking. In this work, drawing inspiration from the expert consultation meeting, we introduce a novel zero-shot VLN framework. Within this framework, large models possessing distinct abilities are served as domain experts. Our proposed navigation agent, namely DiscussNav, can actively discuss with these experts to collect essential information before moving at every step. These discussions cover critical navigation subtasks like instruction understanding, environment perception, and completion estimation. Through comprehensive experiments, we demonstrate that discussions with domain experts can effectively facilitate navigation by perceiving instruction-relevant information, correcting inadvertent errors, and sifting through in-consistent movement decisions. The performances on the representative VLN task R2R show that our method surpasses the leading zero-shot VLN model by a large margin on all metrics. Additionally, real-robot experiments display the obvious advantages of our method over single-round self-thinking.
Paper Structure (14 sections, 4 figures, 3 tables)

This paper contains 14 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison between single round self-thinking and multi-expert discussions inference paradigms. The single-round self-thinking models passively take all accessible navigation information as input and have to make a prediction in one execution, while our DiscussNav agent relieves from completing complex reasoning at one time and can actively obtain the needed information via multi-expert discussions.
  • Figure 2: Demonstration of navigation agent DiscussNav powered by large language model (GPT4). The DiscussNav agent fills in question templates in the corpus to discuss with multiple domain experts and prompts the large language model to make navigation decisions based on the multi-expert discussion results. The "[xxx]" are slots that should be filled with specified information. We use color (See Fig \ref{['fig:method']}) to distinguish different expert discussion results. In the Matterport3D simulation mattersim test, the DiscussNav agent will choose the first candidate viewpoint of the predicted direction to move.
  • Figure 3: Establishments of domain experts and discussions between the DiscussNav agent and multi-experts. After receiving the VLN instruction, DiscussNav will first discuss with instruction analysis experts to learn about actions and landmarks. Then, at every movement step, DiscussNav will communicate with vision perception experts about landmark-relevant visual information in surrounding directions and interact with completion estimation experts about executed actions. Based on discussion results, DiscussNav will make $N$ different predictions and invite decision testing experts to decide final movement direction.
  • Figure 4: The qualitative results of DiscussNav's performance on the real robot. Through discussions with domain experts, DiscussNav can observe open vocabulary landmarks and navigate to fine-grained landmarks. In the discussion, error information can be timely corrected by other experts or the DiscussNav agent, which reduces the error accumulation. Besides, in-consistent movement decisions can be filtered by discussions with decision testing experts.