Table of Contents
Fetching ...

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Gengze Zhou, Yicong Hong, Qi Wu

TL;DR

The paper tackles vision-language navigation by introducing NavGPT, a purely LLM-driven agent that performs zero-shot VLN by translating multi-modal visual observations into natural language prompts and exposing explicit reasoning with a navigation-history buffer. It integrates a visual perceptron (BLIP-2, object detectors, depth) and a prompt manager to feed a reasoning-enabled LLM (via ReAct-like traces) that selects actions in a navigation graph while maintaining progress history. Key findings show that GPT-4 can perform high-level planning, sub-goal decomposition, landmark identification, and trajectory visualization, but zero-shot performance lags supervised models due to perceptual description quality and object-tracking limitations; ablations demonstrate the impact of observation granularity and additional semantic cues. The work highlights the potential of coupling LLM reasoning with multi-modal perception to advance embodied navigation and suggests future work on multi-modal LLMs or hybrid systems to achieve robust, general VLN agents.

Abstract

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

TL;DR

The paper tackles vision-language navigation by introducing NavGPT, a purely LLM-driven agent that performs zero-shot VLN by translating multi-modal visual observations into natural language prompts and exposing explicit reasoning with a navigation-history buffer. It integrates a visual perceptron (BLIP-2, object detectors, depth) and a prompt manager to feed a reasoning-enabled LLM (via ReAct-like traces) that selects actions in a navigation graph while maintaining progress history. Key findings show that GPT-4 can perform high-level planning, sub-goal decomposition, landmark identification, and trajectory visualization, but zero-shot performance lags supervised models due to perceptual description quality and object-tracking limitations; ablations demonstrate the impact of observation granularity and additional semantic cues. The work highlights the potential of coupling LLM reasoning with multi-modal perception to advance embodied navigation and suggests future work on multi-modal LLMs or hybrid systems to achieve robust, general VLN agents.

Abstract

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.
Paper Structure (30 sections, 1 equation, 17 figures, 3 tables)

This paper contains 30 sections, 1 equation, 17 figures, 3 tables.

Figures (17)

  • Figure 1: The architecture of NavGPT. NavGPT synergizes reasoning and actions in LLMs to perform zero-shot Vision-and-Language Navigation following navigation system principles. It interactives with different visual foundation models to adapt multi-modality inputs, handle the length of history with a history buffer and a GPT-3.5 summarizer, and aggregate various sources of information through a prompt manager. NavGPT parse the generated results from LLMs (LLM Thoughts and LLM Action) to move to the next viewpoint.
  • Figure 2: The process of forming natural language description from visual input. We used 8 directions to represent a viewpoint and show the process of forming the descriptions for one of the directions.
  • Figure 3: The qualitative of NavGPT. NavGPT can explicitly perform high-level planning for sequential action prediction, including decomposing instruction into sub-goal, integrating commonsense knowledge, identifying landmarks from observed scenes, tracking navigation progress, exceptions handling with plan adjustment.
  • Figure 4: We evaluate GPT-4 on a case where NavGPT successfully follows the ground truth path, using only the historical actions $\mathcal{A}_{<t+1}$ and observations $\mathcal{O}_{<t+1}$ to generate an instruction (without reasoning trace $\mathcal{R}_{<t+1}$ to avoid information leaking), and using the entire navigation history $\mathcal{H}_{<t+1}$ to draw a top-down trajectory.
  • Figure 5: The prompt for GPT-3.5 summarizer and the summarized results. The original descriptions from BLIP-2 are in orange.
  • ...and 12 more figures