NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Gengze Zhou, Yicong Hong, Qi Wu
TL;DR
The paper tackles vision-language navigation by introducing NavGPT, a purely LLM-driven agent that performs zero-shot VLN by translating multi-modal visual observations into natural language prompts and exposing explicit reasoning with a navigation-history buffer. It integrates a visual perceptron (BLIP-2, object detectors, depth) and a prompt manager to feed a reasoning-enabled LLM (via ReAct-like traces) that selects actions in a navigation graph while maintaining progress history. Key findings show that GPT-4 can perform high-level planning, sub-goal decomposition, landmark identification, and trajectory visualization, but zero-shot performance lags supervised models due to perceptual description quality and object-tracking limitations; ablations demonstrate the impact of observation granularity and additional semantic cues. The work highlights the potential of coupling LLM reasoning with multi-modal perception to advance embodied navigation and suggests future work on multi-modal LLMs or hybrid systems to achieve robust, general VLN agents.
Abstract
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.
