Table of Contents
Fetching ...

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu

TL;DR

This work strives to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning by aligning visual content in a frozen LLM.

Abstract

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

TL;DR

This work strives to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning by aligning visual content in a frozen LLM.

Abstract

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.
Paper Structure (37 sections, 6 equations, 6 figures, 7 tables)

This paper contains 37 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Left: Besides performing effective navigation planning, NavGPT-2 is capable of generating navigational reasoning in a human-interpretable way. Right: NavGPT-2 can support multi-round interaction with the user and plan according to the user’s intervention in the navigation process, actively ask for help, and answer visual questions.
  • Figure 2: Model architecture of NavGPT-2, it consists of a multimodality Large Language Model and a topological graph-based navigation policy network. The yellow blocks indicate the trainable module at stage one, the red blocks indicate the trainable module at stage two, and the blue blocks are frozen.
  • Figure 3: Navigation system prompt for NavGPT-2.
  • Figure 4: Data generation pipeline and visual instruction tuning on navigation reasoning data. $\{\mathcal{I}, \mathcal{O}\}$ denotes the instruction-observation pairs on the R2R trajectories. $\mathcal{R}$ is the generated reasoning from GPT-4V, $\mathcal{R}'$ is the generated reasoning from NavGPT-2.
  • Figure 5: Navigation reasoning generation prompt for GPT-4V.
  • ...and 1 more figures