Table of Contents
Fetching ...

Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation

Kaiming Jin, Yuefan Wu, Shengqiong Wu, Bobo Li, Shuicheng Yan, Tat-Seng Chua

TL;DR

By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability, and provides a principled and extensible paradigm for robust long-horizon navigation.

Abstract

Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo

Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation

TL;DR

By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability, and provides a principled and extensible paradigm for robust long-horizon navigation.

Abstract

Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo
Paper Structure (27 sections, 8 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 8 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of different agentic systems for scene navigation. Multi-agent systems rely on multiple experts, resulting in high coordination overhead and resource costs. Single-agent systems must handle both global planning and local perception, overloading the decision process. In contrast, our dual-agent framework assigns clear and complementary roles to global and local agents, simplifying system design and enabling more robust navigation reasoning.
  • Figure 2: Overview of the DACo framework. Our system comprises two collaborative components: (1) Global Agent: Acting as a high-level strategic planner, the Global Agent maintains a panoramic perspective by integrating current location descriptions, historical trajectories, and Top-down maps. It iteratively generates a dynamic global plan to guide the navigation. (2) Local Agent: Serving as the low-level executor, the Local Agent initiates each time step by synthesizing local observations into a concise environment description and issuing a planning or re-planning request. Upon receiving the global guidance, it grounds the high-level plan into primitive navigation actions to interact with the environment.
  • Figure 3: The illustration of the self-correction capability inherent in DACo system. Our framework achieves robust self-correction through two complementary processes: (1) Dynamic Planning: The Global Agent monitors the Local Agent's trajectory via a top-down view; upon detecting a deviation, it adaptively refines the global plan to rectify the path. (2) Replan Mechanism: The Local Agent cross-checks the global plan's validity against its immediate visual observations. If an inconsistency is detected (e.g., a missing landmark), a re-planning cycle is proactively triggered to resolve the discrepancy.
  • Figure 4: Impact of Agent Backbone. All methods are implemented using Qwen2.5-VL-32B, Qwen3-VL-8B, and GPT-4o. The zero success rate of NavGPT on Qwen2.5-VL is primarily due to catastrophic format parsing failures.
  • Figure 5: Results corresponding to samples of different steps. MapGPT is reproduced with GPT-4o.
  • ...and 4 more figures