Table of Contents
Fetching ...

MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, Kwan-Yee K. Wong

TL;DR

MapGPT addresses zero-shot VLN by equipping an agent with an online linguistic topological map embedded in prompts, enabling global exploration and planning. It introduces a single-expert prompting system, a topological map constructed through prompts, and an adaptive multi-step path planning module that works with GPT-4 and GPT-4V. The approach achieves state-of-the-art zero-shot performance on R2R and REVERIE, demonstrating emergent global thinking and planning capabilities in GPT-based agents. This method reduces reliance on annotated data and multi-expert prompting, offering a scalable path toward robust multimodal navigation in embodied AI.

Abstract

Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.

MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation

TL;DR

MapGPT addresses zero-shot VLN by equipping an agent with an online linguistic topological map embedded in prompts, enabling global exploration and planning. It introduces a single-expert prompting system, a topological map constructed through prompts, and an adaptive multi-step path planning module that works with GPT-4 and GPT-4V. The approach achieves state-of-the-art zero-shot performance on R2R and REVERIE, demonstrating emergent global thinking and planning capabilities in GPT-based agents. This method reduces reliance on annotated data and multi-expert prompting, offering a scalable path toward robust multimodal navigation in embodied AI.

Abstract

Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.
Paper Structure (32 sections, 2 equations, 6 figures, 4 tables)

This paper contains 32 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A comparison of the thinking process of the GPT agent without and with topological maps. Given only a local action space, the agent may explore aimlessly, especially when navigation errors have already occurred. Incorporating topological maps enables the agent to understand spatial structures and engage in global exploration and path planning.
  • Figure 2: Our basic system consists of two types of prompts, namely task description and fundamental inputs. We introduce a map-guided prompting method that builds an online-constructed topological map into prompts, activating the agent's global exploration. We further propose an adaptive mechanism to perform multi-step path planning on this map, systematically exploring candidate nodes or sub-goals. Note that vision models are optional, and viewpoint information can be represented using either the image or textual description of the observations.
  • Figure 3: A successful case on REVERIE showcases MapGPT's (GPT-4V based) various abilities, including global exploration (blue), map understanding (yellow), and adaptive multi-step path planning (green). The six images on the right represent six unexplored places at step 4. Among these, MapGPT focuses on four possible places and systematically explores them until it discovers the bathroom when moving to place 8.
  • Figure 4: Task description prompts for the R2R and REVERIE datasets. We make some simple yet necessary modifications to transfer MapGPT from the R2R task to REVERIE. This work focuses on unified navigation, while instructions in REVERIE often require some interactive actions on objects. Therefore, we require the agent to ignore these actions.
  • Figure 5: A successful example on the R2R dataset. We demonstrate some crucial steps that leverage map-guided global exploration and planning capabilities, ultimately resulting in successful navigation.
  • ...and 1 more figures