Table of Contents
Fetching ...

Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li

TL;DR

The paper presents PReP, an agentic workflow for goal-directed city navigation without instructions, combining perception (landmark-based localization via a fine-tuned LLaVA), memory-driven reflection to build a cognitive map, and planning to produce long-horizon routes. It demonstrates that memory augmentation and structured planning substantially improve navigation over baselines, achieving a 54% average success rate across four major cities in a CBD-scale dataset. The approach is data-efficient, requiring limited training for perception and leveraging LLMs for memory synthesis and planning, offering a practical path toward autonomous urban navigation with reduced reliance on explicit instructions. Overall, PReP highlights the value of perception-grounded cognitive maps and reflection-informed planning for robust, long-range navigation in complex urban environments.

Abstract

This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.

Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

TL;DR

The paper presents PReP, an agentic workflow for goal-directed city navigation without instructions, combining perception (landmark-based localization via a fine-tuned LLaVA), memory-driven reflection to build a cognitive map, and planning to produce long-horizon routes. It demonstrates that memory augmentation and structured planning substantially improve navigation over baselines, achieving a 54% average success rate across four major cities in a CBD-scale dataset. The approach is data-efficient, requiring limited training for perception and leveraging LLMs for memory synthesis and planning, offering a practical path toward autonomous urban navigation with reduced reliance on explicit instructions. Overall, PReP highlights the value of perception-grounded cognitive maps and reflection-informed planning for robust, long-range navigation in complex urban environments.

Abstract

This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.
Paper Structure (34 sections, 7 figures, 5 tables)

This paper contains 34 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An illustrative comparison of city navigation results. The proposed workflow method (blue) successfully reaches the goal, and its path is close to the shortest path (yellow). The React method (without workflow) fails because it makes short-sighted decisions. In one scenario, the React agent hits a dead end and keeps moving toward it because the goal is in that direction. In another scenario, the agent moves in circles because the goal's direction changes as it moves. The React agent have no memory so it can not take detours.
  • Figure 2: Task example and dataset regions. A task example is shown in (a). The instruction to the agent is the relative location of the goal w.r.t the landmarks in the city environment. The agent perceives the street views and recognize the landmarks. Then the agent has to infer the goal position relative to its current location using its observations of landmarks and move through the urban space. The road networks are from chosen CBD areas in Beijing (b), Shanghai (c), New York (d) and Paris (e). Blue points represent the landmarks while red lines are roads.
  • Figure 3: Overview of PReP workflow. It has three steps: perception, reflection, and planning. Blue boxes represent LLMs or LLaVA, while gray boxes indicate variables stored by natural language. Symbols are defined in section \ref{['section:env']}.
  • Figure 4: Sample prompts and responses in the PReP workflow. In perception, a vision language model locates the landmarks and estimates their distances to the agent. In reflection, the agent reflects on past memory, and gives an estimate of the direction of the goal. In planning, the agent uses the output from reflection to update the plan. Prompts have been simplified while retaining their original meaning. The full prompts are provided in Appendix \ref{['appendix:prompt']}.
  • Figure 5: Performance of PReP across varying task difficulties
  • ...and 2 more figures